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1 

S P EE CH sSYNTHESJS SYS T E M 

The present invention relates to speech synthesis systems, and in 
particular to speech systems coding and synthesis systems which can be used in 
speech communication systems operating at low bit rates. 

Speech can be represented as a waveform the detailed structure of which 
represents the characteristics of the vocal tract and vocal excitation of the person 
producing the speech. If a speech communication system is to he capable of 
providing an adequate fierceived qualit}% the transmitted information must be 
capable of representing that detailed structure. Most of the power in voiced 
speech is at relatively low frequencies, for example below 2kHz. Accordingly 
good quality speech synthesis can be achieved on the basis of speech waveforms 
that have- been low pass filtered to reject higher frequency components. The 
perceived speech quality is however adversely effected if the frequency is 
restricted much below 4kHz. 

Many models have been suggested for dcfming the characteristics of 
speech. The known models rely upon dividing a speech signal into blocks or 
frames and deriving parameters to represent the characteristics of the speech 
within each frame. Those parameters are then quantized and transmitted to a 
receiver. At the receiver the quantization process is reversed to recover the 
parameters, and a speech signal is then synthesised on the basis of the recovered 
parameters. 
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The common objective of the designers of the known models is to 
minimise the volume of data which must be transmitted whilst maximising the 
perceived quality of the speech that can be synthcsised from the transmitted 
data* In some of the models a distinction is made between whether or not a 
particular frame is "voiced" or '^unvoiced". In the case of voiced speech, speech 
is produced by glottal excitation and as a result has a quasi-periodic structure. 
Unvoiced speech is produced by turbulent air flow at a constriction and docs not 
have the "periodic" spectral structure characteristic of voiced speech. Most 
models seek to take advantage of the fact that voiced speech signals evolve 
relatively slowly in the context of frames the duration of which is typically 10 to 
30msecs. Most models also rely upon quantization schemes intended to minimise 
the amount of information which must be transmitted without significant loss of 
perceived quality. As a result of the work done to date it is now possible to 
produce speech synthesis systems capable of operating at" bit rate of only a few 
thousand bits per second. 

One model which has been developed is known as ''sinusoidal coding" 
(R.J, McAulay and T.F. Quatieri, ''Low Rate Speech Coding Based on Sinusoidal 
Coding"^ Advances in Speech Signal Processing, Editors S. Furui and M.M. 
Sondhi, Chapter 6, pp. 165-208, Markcl Dekker, New York, 1992). this 
approach relies upon an FFT analysis of each input frame to produce a 
magnitude spectrum, estimating the pitch period of the input frame from that 
spectrum, and defining the amplitudes at the pitch related harmonics, the 
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harmonics being multiples of the fundamental frequency of the frame. An error 
measure is calculated in the time domain representing the difference between 
harmonic and aharmonic speech spectra and that error measure is used to define 
the degree of voicing of the input frame in terms of a frequency value. Thus the 
parameters used to represent a frame are the pitch period, the magnitude and 
phase vaiues for each harmonic^ and the frequency value. Proposals have been 
made to operate this system such that phase information is predicted in a 
coherent way across successive frames. 

In another system known as '^multiband excitation coding" (DAV. Griffin 
and J.S- Lim, "Multiband Excitation Vocoder" IEEE Transaction on Acoustics, 
Speech and Signal Processing, vol. 36, pp 1223-1235, 1988 and Digital Voice 
Systems Inc, "INMARSAT M Voice Codec, Version 3,0'\ Voice Coding System 
Description, Module 1, Appendix 1, August 1991) the amplitude and phase 
functions are determined in a different way from that employed,jn sinusoidal 
coding. The emphasis in this system is placed on dividing a spectrum into bands, 
for example up to twelve bands, and evaluating the voiced/unvoiced nature of 
each of these bands. Bands that are classifled as unvoiced are synthesised using 
random signals. Where the difference between the pitch estimates of successive 
frames is relatively small, linear interpolation is used to define the required 
amplitudes. The phase function is also defined using linear frequency 
interpolation but in addition includes a constant displacement which is a random 
variable and which depends on the number of unvoiced bands present in the 
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short term spectrum of the input signal. The system works in a way to preserve 
phase continuity between successive frames. When the pitch estimates of 
successive frames are signiHcantly different, a weighted summation of signals 
produced from amplitudes and phases derived for successive frames is formed to 
produced the synthesised signal. 

Thus the common ground between the sinusoidal and muitiband systems 
referred to above is that both schemes directly model the input speech signal 
which is DFT analysed, and both systems arc af least partially based on the same 
fundamental relationship for representing speech to be synthesised. The systems 
differ however in terras of the way in which amplitudes and phase are estimated 
and quantized, the way in which different interpolation methods arc used to 
define the necessary phase relationships, and the way in which "randomness" is 
introduced in the recovered speech. 

Various versions of the muitiband excitation coding system have been 
proposed, for example an enhanced muitiband excitation speech coder (A. Das 
and A. Gersho, Variable'Dimension Spectral Coding of Speech at 2400 bps and 
below with phonetic classification '\ IEEE Proc. ICASSP;-95, pp. 492-495, May 
1995) in which input frames are classiFied into four types, that is noise, unvoiced, 
fully voiced and mixed voiced, and a variable dimension vector quantization 
process for spectral magnitude is introduced, the bi-harmonic spectral modelling 
system (C. Garcia-Matteo., J. L. Alba-Castro and Eduardo R. Banga, "Speech 
Coding Using Bi-Harmonic Spectral Modelling*^ Proc. EUSIPCO-94, 
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Edingburgh, Vol. 2, pp. 391-394, September 1994) in which the short term 
magnitude spectrum is divided into tvvo bands and a separate pitch frequency is 
calculated for each band, the spectral excitation coding system (V. Cupcrman, P. 
Lupini and B. Bhattacharya, ''Spectral Excitation Coding of Speech at 2.4 kb/s*\ 
IEEE Proc. lCASSP-95, pp. 504-507, Detrpot, May 1995) which applies 
sinusoidal based coding in the linear predictive coding (LPC) residual domain 
where the synthesiscd residual signal is the summation of pitch harmonic 
oscillators with appropriate amplitude and phase functions and amplitudes are 
quantized using a nonrsqunrc transformation, the band-widened harmonic 
vocoder (G, Yang, G Zanellato and II. Leich, ''Band Widened Harmonic Vocoder 
at 2 to 4 kbps'\ IEEE Proc. lCASSP-95, pp. 504-507, Detroit, May 1995) in which 
randomness in the signal is introduced by adding jitter to the amplitude 
information on a per band basis, pitch synchronous multiband coding (H. Yang, 
S. N. Koh and P. Sivaprijkasapilai, "Pitch Synchronous Multi-Band (PSMB) 
Speech Coding", IEEE Proc. ICASSP-95, pp. 516-519, Detroit, May 1995) in 
which a CELP (code-excited linear prediction) based coding scheme is used to 
encode speech period segments, multi band LPC coding (S. Yeldencr, M. Kondoz 
and G. Evans, ''High Quality Multiband LPC Coding of Speech at 2.4 kbits/s*\ 
Electronic Letters, pp. 1287-1289, Vol. 27, No 14, 4th July 1991) in which a single 

amplitude value is allocated^.to each frame to in effect specify a "flat" residual 
spectrum, and harmonic and noise coding (M. Nishiguchi and J. Matsumoto, 

''Harmonic and Noise Coding of LPC Residuals with Classified Vector 
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Quantisation'', IEEE Proc. ICASSP-95, pp. 484-487, Detroit, May 1995) with 
classifled vector quantization which operates in the LPC residual domain, an 
input signal being classified as voiced or unvoiced and being full band modelled. 

A further type of coding system exists, that is the prototype interpolation 
coding system. This relies upon the use of pitch period segments or prototypes 
which are spaced apart in time and reiteration/interpolation techniques to 
synthesise the signal between two prototypes. Such a system was described as 
early as 1971 (J.S. Severwight, 'Mntcrpolation Reiterations Techniques for 
Efficient Speech Transmission", Ph.D. Thesis, Loughborough Universit>', 
Department of Electrical Engineering, 1971). More sophisticated systems of the 
same general class have been described more recently, for example in the paper 
by W,B. Kleijn, "Continuous Representations in Linear Predictive Coding, Proc. 
ICASSP-91, pp201-204. May 1991. The same author has published a series of 
related papers. The system employs ZOmsecs coding frames which are classitled 
as voiced or unvoiced. Unvoiced frames are effectively CELP coded. Pitch 
prototype segments are defined in adjacent voiced frames, in the LPC residual 
signal, in a way which ensures maximum alignment (correlation) of the 
prototypes and defines the prototype so that the main pitch excitation pulse is 
not near to either of the ends of the prototype. A pitch period in a given frame is 
considered to be a cycle of an artificial periodic signal from which the prototype 
for the frame is obtained. The prototypes which have been appropriately 
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selected from adjacent frames are Fourier transformed and the resulting 
coefficients are coded using a differential ^'ector quantization scheme. 

With this scheme, during synthesis of voiced frames, the decoded 
prototype Fourier representations for adjacent frames are used to reconstruct 
the missing signal waveform between the two prototype segments using linear 
interpolation. Thus the residual signal is obtained which is then presented to an 
LPC synthesis filter the output of which provides the synthcsised voiced speech 
signal. An amount of randomness can be introduced into voiced speech by 
injecting noise at frequencies larger than 2khz, the amplitude of the noise 
increasing with frequency. In addition, the periodicity of synthesised voiced 
speech IS controlled during the quantization of prototype parameters in 
accordance with a long term signal to change ratio measure that reflects the 
similarity which exists between the prototypes of adjacent frames in the residual 
excitation signal. 

The known prototype interpolation coding systems rely upon a Fourier 
Series synthesis equation which involves a linear-with-time-interpolation process. 
The assumption is that the pitch estimates for successive frames arc linearly 
interpolated to provide a pitch function and an associated instant fundamental 
frequency. The instant phase used in the cosine and sine terms of the Fourier 
series synthesis equation is the integral of the instantaneous harmonic 
frequencies. This synthesis arrangement allows for the linear evolution of the 



wo 98/01848 PCT/GB97/01831 



instantaneous pitch and the non-linear evolution of the instantaneous harmonic 
frequencies. 

A development of this system is described by W.B, Kleijn and J. Haaden, 
"A Speech Coder Based on Decomposition of Characteristics Waveforms", Proc, 
ICASSP-95, pp508-51l, Detroit, May 1995. In the described system the Fourier 
series coefficients arc low pass filtered over time, with a cut-off frequency of 
20Hz, to provide a "slowly cvolvinj;" waveform component for the LPC 
excitation signal. The difference between this low pass component and the 
original parameters provides>thc "rapidly evolving" components of the excitation 
signal. Periodic voice excitation signals arc mainly represented by the "slowly 
evolving" component, whereas random unvoiced excitation signals are 
represented by the "rapidly evolving" component in this dual decomposition of 
the Fourier series coefficients. This removes effectively the need for treating 
voiced and unvoiced frames separately. Furthermore, the rate of quantization 
and transmission of the two components is different. The "slowly evolving" 
signal is sampled at relatively long intervals of 25msecs, but the parameters are 
quantized quite accurately on the basis of spectral magnitude information. In 
contrast, the spectral magnitude of the "rapidly evolving" signal is sampled 
frequently, every 4msecs, but is quantized less accurately. Phase information is 
randomised every 2msecs. ^ 

Other developments of the prototype interpolation coding system have 
been proposed. For example one known system operates on 5msec frames, a 
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pitch period being selected for voiced frames and DFT transformed to yield 
prototype spectral magnitude values. These values arc quantized and the 
quantized values for adjacent frames are linearly interpolated. Phase 
information is defined in a manner which, does not satisfy any frequency 
restrictions at the interpolation houndaries. This causes problems of 
discontinuity at frame boundaries. At the receiver the excitation signal is 
synthesised using a decoded magnitude and estimated phase values, via an 
inverse DFT process. The resulting signal is nitcrcd by a following LPC 
synthesis filter. This model is purely periodic during voiced speech, and this is 
why a very short duration frame is used. Unvoiced speech is CELP coded. 

The wide range of speech synthesis models currently being proposed, only 
some of which are described above, and the range of alternative approaches 
proposed to implement those 'models, indicates the interest in such systems and 
the lack of any consensus as to which system provides the most advantageous 
performance. 

It is an object of the present invention to provide an improved low bit rate 
speech synthesis system. 

In known systems in which it is necessary to obtain an estimate of the 
pitch of a frame of a speech signal, it has been thought necessar>s if high quality 
of synthesised speech is to be achieved, to obtain high resolution non-integer 
pitch period estimates. This requires complex processes, and it would be highly 
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desirable to reduce the complexity of the pitch estimation process in a manner 
which did not result in degraded quaiit>\ 

According to a first aspect of the present invention, there is provided a 
speech synthesis system in which a speech signal is divided into a scries of 
frames^ and each frame is converted into a coded signal including a 
voiced/unvoiced classification and a pitch estimate, wherein a low pass filtered 
speech segment centred about a reference sample is defined in each frame, a 
correlation value is calculated for each of a series of candidate pitch estimates as 
the maximum of multiple crosscorrelation values obtained from variable length 
speech segments centred about the reference sample, the correlation values are 
used . to form a correlation function defining peaks, and the locations of the 
peaks are determined and used to define a pitch estimate. 

The result of the above system is that an integer pitch period value is 
obtained. The system avoids undue complexity' and may be readily implemented. 

Preferably the pitch estimate is defined using an iterative process. A 
single reference sample may be used, for example centred with respect to the 
respective frame, or alternatively multiple pitch estimates may be derived for 
each frame using different reference samples, the multiple pitch estimates being 
combined to define a combined pitch estimate for the frame. The pitch estimate 
may be modified by reference to a voiced/unvoiced status and/or pitch estimates 
of adjacent frames to define a final pitch estimate. 
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The correlation function may be clipped using a threshold value, 
remaining peaks being rejected if they are adjacent to larger peaks. Peaks are 
initially selected and can be rejected if they arc smaller than a following peak by 
more than a predetermined factor, for example smaller than 0.9 times the 
following peak. 

^ Preferably the pitch estimation procedure is based on a least squares 
error algorithm. Preferably the algorithm defines the pitch as a number whose 
multiples best fit the correlation function peak locations. Initial possible pitch 
values may be limited to integral numbers which arc not consecutive, the 
^ increment between two successive numbers being proportional to a constant 
-multiplied by the lower of those two numbers. 

It is well known from the prior art to classify individual frames as voiced 
or unvoiced and to process those frames in accordance with that classification. 
. Unfortunately such a simple classification process does not accurately reflect the 
true characteristics of speech. It is often the case that individual frames are 
made up of both periodic (voiced) and aperiodic (unvoiced) components. Prior 
attempts to address this problem have not proved particularly effective. 

It is an object of the present invention to provide an improved voiced or 
unvoiced classification system. 

According to a second aspect of the present invention there is provided a 
speech synthesis system in which a speech signal is divided into a series of 
frames, and each frame is converted into a coded signal including pitch segment 
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magnitude spectral information, a voiced/unvoiced classification, and a mixed 
voiced classification which classiFies harmonics in the magnitude spectrum of 
voiced frames as strongly voiced or weakly voiced, wherein a scries of samples 
centred on the middle of the frame arc windowed to form a data array which is 
Fourier transformed to produce a magnitude spectrum, a threshold value is 
calculated and used to clip the magnitude spectrum, the clipped data is searched 
to define peaks, the locations of peaks are determined, constraints are applied to 
deflnc dominant peaks, and harmonics not associated with a dominant peak are 
classified as weakly voiced. 

Peaks may be located using a second order polynomial. The samples may 
be Hamming windowed. The threshold value may be calculated by identifying 
the maximum and minimum magnitude spectrum values and defming the 
threshold as a constant multiplied by the differencci between the maximum and 
minimum values. Peaks may be defined as those, values which arc greater than 
the two adjacent values. A peak may be rejected from consideration if 
neighbouring peaks are of a similar magnitude, e.g. more than 80% of the 
magnitude, or if there are spectral magnitudes in the same range of greater 
magnitudes. A harmonic may be considered as not being associated with a 
dominant peak if the difference between two adjacent peaks is greater than a 
predetermined threshold value* 

The spectrum may be divided into bands of fixed width and a 
strongly/weakly voiced classification assigned- for each band. Alternatively the 
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frequency range may be divided into two or more bands of variable width, 
adjacent bands being separated at a frequency selected by reference to the 
strongly/weakly voiced classification of harmonics. 

Thus, the spectrum may be divided into fixed bands, ^for "example fixed 
bands each of 500Hz, or variable width bands selected in dependence upon the 
strongly/weakly voiced status of harmonic components of the excitation signal. A 
strongly/weakly voiced classification is then assigned to each band. The lowest 
frequency band, e.g. 0-500Hz, may always be regarded as strongly voiced, 
whereas the highest frequency band, for example 3500Hz to 4000Hz, may always 
be regarded as weakly voiced, In the event that a current frame is voiced, and 
the previous frame is unvoiced, other bands within the current frame, e.g. 
3000Hz to 3500Hz may be automatically classified as weakly voiced. Generally 
the strongly/weakly voiced classification may be determineftl using a majority 
decision rule on the strongly/weakly voiced classification of those harmonics 
which fall within the band in question. If there is no majorit\', alternate bands 
may be alternately assigned strongly voiced and weakly voiced classifications. 

Given the classification of a voiced frame such that harmonics are 
classified as either strongly or weakly voiced, it is necessary to generate an 
excitation signal to recover the speech signal which takes into account this 
classification. It is an object of the present invention to provide such a system. 

According to a third aspect of the present invention, there is provided a 
speech synthesis system in which a speech signal is divided into a series of 
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frames, each frame is defined as voiced or unvoiced, each frame is converted into 
a coded signal including a pitch period value, a frame voiced/unvoiced 
classification and, for each voiced frame, a mixed voiced spectral band 
classification which classifies harmonics within spectral bands as either strongly 
or weakly voiced, and the speech signal is reconstructed by generating an 
excitation signal in respect of each frame and applying the excitation signal to a 
filter, wherein for each weakly voiced spectral band, an excitation signal is 
generated which includes a random component in the form of a function which is 
^dependent upon the respective pitch period value. 

Thus for each frame which has a spectral band that is classified as weakly 
voiced, the excitation signal is represented by a function which includes a first 
harmonic frequency component, the frequency of which is dependant upon the 
pitch period value appropriate to that frame, and a second random component 
which is superimposed upon the first component. 

The random component may be introduced by reducing the amplitude of 
harmonic oscillators assigned the weakly voiced classification, for example by 
rieducing the power of the harmonics by 50%, while disturbing the oscillator 
frequencies, for example by shifting the oscillators randomly in frequency in the 
range of 0 to 30 Hz such that the frequency is no longer a multiple of the 
fundamental frequency, and then adding further random signals. The phase of 
the oscillators producing random signals may be randomised at pitch intervals. 
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Thus for a weakly voiced band, some periodicitA' remains but tlic power of the 
periodic component is reduced and then combined with a random component. 

In a speech synthesis system in which a speech signal is represented in 
part by spectral information in the form of harmonic magnitude values^ it is 
possible to process an input speech signal to produce a series of spectral 
magnitude values and then to use all of those magnitude values at harmonic 
locations in subsequent processing steps. In many circumstances however at 
least some of the magnitude values contain little information which is useful in 
the recovcjry of the input speech signal. Accordingly when magnitude values arc 
quantized for transmission to a receiver it is sensible to discard magnitude values 
which contain little useful information. 

In one known system an input speech signal is processed to produce an 
LPC residual signal which in turn is processed to provide harmonic magnitude 
values, but only a fixed number of those magnitude values is vector quantized for 
transmission to a receiver. The discarded magnitude values are represented at 
the receiver as identical constant values. This known system reduces 
redundancy but is inflexible in that the locations of the fixed number of 
magnitude values to be quantized are always the same and predetermined on the 
basis of assumption that may be inappropriate in particular circumstances. 

It is an.object of the present invention to provide an improved magnitude 
value quantization system. 
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According to a fourth aspect of the present invention, there is provided a 
speech synthesis system in which a speech signal is divided into a series of 
frames, and each voiced frame is converted into a coded signal including a pitch 
period value, LPC coefficients, and pitch segment spectral magnitude 
information, wherein the spectral magnitude information is quantized by 
sampling the LPC short term magnitude spectrum at harmonic frequencies, the 
locations of the largest spectral samples are determined to identify which of the 
magnitudes are relatively more important for accurate quantization, and the 
magnitudes so identified are selected and vector quantized. % 

Thus rather than relying upon a simple location selection strategy of a 
fixed number of magnitude values for quantization and transmission, for 
example the "low part" of the magnitude spectrum, the invention selects only 
those values which make a significant contribution according to tfvc subjectively 
important LPC magnitude spectrum, thereby reducing redundancy without 
compromising quality. 

In one arrangement in accordance with the invention a pitch segment of 
Pn LPC residual samples is obtained, where P„ is the pitch period value of the 
nth frame, the pitch segment is DFT transformed, the mean value of the 
resultant spectral magnitudes is calculated, the mean value is quantized and used 
as a normalisation factor for the selected magnitudes, and- the resulting 
normalised amplitudes are quantized. 
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Alternatively, the RMS value of the pitch segment is calculated, the RMS 
value is quantized and used as a normalisation factor for the selected 
magnitudes, and the resulting normalised amplitudes are quantized. 

At the receiver, the selected magnitudes are recovered, and each of the 
other magnitude values is reproduced as a constant value. 

Interpolation coding systems which employ a pitch-related synthesis 
formula to recover speech generally encounter the problem of coding a variable 
length, pitch dependant spectral amplitude vector. The quantization scheme 
referred to above in which only the magnitudes of relatively greater impqrtance 
are quantized avoids this problem by quantizing only a fixed number of 
magnitude values and setting the rest of the magnitude values to a constant 
value. Thus at the receiver a fixed length vector can be recovered. Such a 
solution to the problem however may result in a relatively spectrally flat 
excitation model which has limitations in providing high recovered speech 
quality. 

In an ideal world output speech quality would be maximised by 
quantizing the entire shape of the magnitude spectrum, and various approaches 
have been proposed for coding the entire magnitude spectrum. In one approach, 
the spectrum is DFT transformed and coded differentially across successive 
spectra. This and similar coding schemes are rather inefficient however ajid 
operate with relatively high bit rates. The introduction of vector quantization 
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allowed for the development of sinusoidal and prototype interpolation systems 
which operate at lower bit rates, typically around 2.4Kbits/scc. 

Two vector quantization methodologies have been reported which 
quantize a variable size input "vector with a fixed size code vector. In a first 
approach, the input vector is transformed to a fixed size vector which is then 
conventionally vector quantized. An inverse transform of the quantized fixed 
size vector yields the recovered quantized vector. Transformation techniques 
which have been used include linear interpolation, band limited interpolation, all 
pole modelling and non-square transformation. This approach however 
produces an overall distortion which is the summation of the vector quantization 
noise and a component which is introduced by the transformation process. In a 
second known approach, a variable input vector is directly quantized with a 
fixed size code vecton Thi^'approach is based on selecting only a limited number 
of elements from each cpdebook vector to form a distortion measure between a 
codebook vector and an input vector. Such a quantization approach avoids the 
transformation distortion of the alternative technique mentioned above and 
results in an overall distortion that is equal to the vector quantization noise, but 
this is significant. 

It is an object of the present invention to provide an improved variable 
sized spectral vector quantization scheme. 

According to a fifth aspect of the present invention, there is provided a 
speech synthesis system in which a variable size input vector of coefficients to be 

ft 
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transmitted to a receiver for the reconstruction of a speech signal is vector 
quantized using a codebook defined by vectors of fixed size, the codebook vectors 
of fixed size are obtained from variable size training vectors and an interpolation 
technique w^hich is an integral part of the codebook generation process, codebook 
vectors are compared to the variable sized input vector using the interpolation 
process, and an index associated with the codebook entry with the smallest 
difference from the comparison is transmitted, the index being used to address a 
further codebook at the receiver and thereby derive an associated fixed size 
codebook vector, and the interpolation process being used to recover from the 
derived fixed sized codebook vector an approximation of the variable sized input 
vector. 

The invention is applicable in particular to pitch synchronous low bit rate 
coders of the type described in this document and takes advantage of the 
underlying principle of such coders which means that the shape of the magnitude 
spectrum is represented by a relatively small number of equally spaced samples. 

Preferably the interpolation process is linear. For an input vector of 
given dimension, the interpolation process is applied to produce from the 
codebook vectors a set of vectors of that given dimension. A distortion measure 
is then derived to compare the interpolated set of vectors and the input vector 
and the codebook vector which yields the minimum distortion is selected. 

Preferably the dimension of the input vectors is reduced by taking into 
account only the harmonic amplitudes with the input brandwidth range, for 
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example 0 to 3.4kH2, Preferably the remaining amplitudes i.e. in the region of 
3,4kHz to 4 kHz are set to a constant value. Preferably, the constant value is 
equal to the mean value of the quantized amplitudes. 

Amplitude vectors obtained from adjacent residual frames exhibit 
significant amounts of redundancy which can be removed by means of backward 
prediction. The backward prediction may be performed on a harmonic basis 
such that the amplitude value of each harmonic of one frame is predicted from 
the amplitude value of the same harmonic in the previous frame or frames. A 
fixed linear predictor may be incorporated in the system, together with meaif 
removal and gain shape quantization processes which operate on a resulting 
error magnitude vector. 

Although the above described variable sized vector quantization scheme 
provides advantageous characteristics, and in particular provides for good 
perceived signal quality at a bit rate of for example 2.4Kbits/sec, in some 
environments a lower bit rate would be highly desirable even at the loss of some 
quality. It would be possible for example to rely upon a single value 
representation and quantization stratcg}' on the assumption that the magnitude 
spectrum of the pitch segment in the residual domain has an approximately flat 
shape. Unfortunately systems based on this assumption have a rather poor 
decoded speech quality. 

It is an object of the present invention to overcome the above limitation in 
lower bit rate systems. 
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According to a sixth aspect of the present invention, there is provided a 
speech synthesis system in which a speech signal is divided into a scries of 
frames, each frame is converted into a coded signal including an estimated pitch 
period* an estimate of the energy of a speech segment the duration of which is a 
function of the estimated pitch period, and LPC filter coefficients defining an 
LPC spectral envelope, and a speech signal of related power to the power of the 
input speech signal is reconstructed by generating an excitation signal using 
spectral amplitudes which arc defined from a modified LPC spectral envelope 
sampled at the harmonic frequencies defined by the pitch period. 

Thus, although a single value is used to represent the spectral envelope of 
the excitation signal, the excitation spectral envelope is shaped according to the 
LPC spectral envelope. The result is a system which is capable of delivering high 
quality speech at 1.5Kbits/sec, The invention is based on the observation that 
some of the speech spectrum resonance and anti-resonance information is also 
present in the residual magnitude spectrum, since LPC inverse filtering cannot 
produce a residual signal of absolutely flat magnitude spectrum. As a 
consequence, the LPC residual signal is itself highly intelligible. 

The magnitude values may be obtained by spectrally sampling a modified 
LPC synthesis filter characteristic at the harmonic locations related to the pitch 
period. The modified LPC synthesis filter may have reduced feed back gain and 
a frequency response which consists of equalised resonant peaks, the locations of 
which are close to the LPC synthesis resonant locations. The value of the feed 
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back gain may be controlled by the perf ormance of the LPC model such that it is 
for example proportional to the normniised LPC prediction error. The energy of 
the reproduced speech signal may be equal to the energy of the original speech 
waveform. 

It is well known that in prototype interpolation coding speech synthesis 
systems there are often substantial similarities between the prototypes of 
adjacent frames in the residual excitation signals. This has been used in various 
systems to improve perceived speech quaiit> by ensuring that there is a smooth 
evolution of the speech signal over time. 

It is an object of the present invention to provide an improved speech 
synthesis system in which the excitation and vocal tract dynamics are 
substantially preserved in the recovered speech signal. 

According to a seventh aspect of the^^present invention, there is provided a 
speech synthesis system in which a speech signal is divided into a series of 
frames, each frame is converted into a coded signal including LPC filter 
coefficients and at least one parameter associated with a pitch segment 
magnitude, and the speech signal is reconstructed by generating two excitation 
signals in respect of each frame, each pair of excitation signals comprising a first 
excitation signal generated on the basis of the pitch segment magnitude 
parameter or parameters of one frame and a second excitation signal generated 
on the basis of the pitch segment magnitude parameter or parameters of a 
second frame which follows and is adjacent to the said one frame, applying the 
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first excitation signal to a first LPC filter the characteristics of which arc 
determined by the LPC filter coefficients of the said one frame and applying the 
second excitation signal to a second LPC filter the characteristics of which are 
determined by the LPC filter coefficients of the said second frame, and weighting 
and combining the outputs of the first and second LPC filters to produce one 
frame of a synthesised speech signal. 

Preferably the first and second excitation signals include the same phase 
function and different phase contributions from the two LPC filters involved in 
the above double synthesis process. This reduces the degree of pitch periodicity 
in the recovered signals. This and the combination of the first and second LPC 
filter outputs ensures an effective smooth evolution of the speech spectral 
envelope on a sample by sample basis. 

Preferably the outputs of the first and second 'LPC filters are weighted by 
half a window function such as a Hamming window such that the magnitude of 
the output of the first filter is decreasing with time and the magnitude of the 
output of the second filter is increasing with time. 

According to an eighth aspect of the present invention, there is provided a 
speech coding system which operates, on a frame by frame basis, and in which 
information is transmitted which represents each frame as either voiced or 
unvoiced and, for each voiced frame, represents -that frame by a pitch period 
value, quantized magnitude spectral information, and LPC filter coefficients, the 
received pitch period value magnitude spectral information being used to 
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generate residual signals at the receiver which are applied to LPC speech 
synthesis filters the characteristics of which arc determined by the transmitted 
filter coefficients, wherein each residual signal is synthesised according to a 
sinusoidal mixed excitation synthesis process, and a recovered speech signal is 
derived from the residual signals. 
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Embodiments of the present invention will now be described, by way of 
example, with reference to the accompanying drawings, in which: 

Figure I is a general block diagram of the encoding process in accordance 
with the present invention; 

Figure 2 illustrates the relationship between coding and matrix 
quantisation frames; 

Figure 3 is a general block diagram of the decoding process; 

Figure 4 is a block diagram of the excitation synthesis process; 

Figure 5 is a schematic diagram of the overlap and add process; 
"^Figure 6 is a schematic diagram of the calculation of an instantaneous 
scaling factor; 

Figure 7 is a block diagram of the overall voiced/unvoiced classification 
and pitch. estimation process; 

Figiire 8 is a block diagram of the pitch estimation process; 

Figure 9 is a schematic diagram of two speech segments which participate 

in the calculation of a crosscorreiation function value; 

Figure 10 is a schematic diagram of speech segments used in the 
calculation of the crosscorreiation function value; 

Figure 11 represents the value allocated to a parameter used in the 

calculation of the crosscorreiation function value for different delays; 

Figure 12 is a block diagram of the process used for calculated the 
crosscorrelating function and the selection of its peaks; 

Figure 13 is a flow chart of a pitch estimation algorithm; 
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Figure 14 is a flow chart of a procedure used in tlie pitch estimation 
process; 

Figure 15 is a flow chart ol a further procedure used in the pitch 
estimation process; 

Figure 16 is a flow chart of a further procedure used in the pitch 
estimation process. 

Figure 17 is a flow chart of a threshold value selection procedure; 

Figure 18 is a flow chart of the voiced/unvoiced classification process; 

Figure 19 is a schematic diagram of the v.oiced/unvoiccd classification 
process with respect to parameters generated during the pitch estimation 
process; 

Figure 20 is a flow chart of the procedure used to determine offset values; 

Figure 21 is a flow chart of the pitch estimation algorithm; 

Figure 22 is a flow chart of a procedure used^to impose constraints on 
output pitch estimates to ensure smooth evolution of pitch values with time; 

Figures 23, 24 and 25 represent different portions of a flow chart of a 
pitch postprocessing procedure; 

Figure 26 is a general block diagram of the LPC analysis and LPC 
quantisation process; 

Figure 27 is a general flow chart of a strongly or weakly voiced 
classification process; 
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Figure 28 is a flow chart of the procedure responsible for the 
strongly/weakly voiced classification. 

Figure 29 represents a speech waveform obtained from a particular 
speech utterance; ' 

Figure 30 shows frequency tracks obtained for the speech utterance of 
Figure 29; 

Figure 31 shows to a larger scale a portion of Figure 30 and represents the 
difference between strongly and weakly voiced classifications; 

Figure 32 shows a magnitude spectrum of a particular speech segment 
and the corresponding LPC spectral envelope and the normalised short term 
magnitude spectra of the corresponding residual segment, excitation segment 
obtained using a binar>' excitation model and an excitation segment obtained 
using the strongly/weakly voiced model; ^ 

Figure 33 is a general block diagram of a system for representing and 
quantising magnitude information; 

Figure 34 is a block diagram of an adaptive quantiser shown in Figure 33; 

Figure 35 is a general block diagram of a quantisation process; 

Figure 36 is a general block diagram of a differential variable size 
spectral vector quantiser; and 

Figure 37 represents the hierarchical structure of a mean gain shape 
quantiser. 
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A system in accordance with the present invention is described below, firstly in general terms 
and then in greater detail. The system operates on an LPC residual signal on a frame by 
frame basis. 

Speech is synthesised using the following general expression: 

s(i) = ZA,(i)cos(d,(i) + (J)0 - (1) 

where i is the sampling instant and A^Ci) represents the amplitude value of the kth cosine term 
cos(0^ (/)) (with 0jt (/) == a A (0 + <l> A ) as a function of i. In voiced speech K depends on the 

pitch frequency of the signal, 

A voiced/unvoiced classification process allows the coding of voiced and unvoiced fi^es to 
be handled in different ways. Unvoiced frames are modelled in terms of an RMS value and a 
random time series. In voiced frames a pitch period estimate is obtained and used to define a 
pitch segment which^is centred at the middle of the frame. Pitch segments from adjacent 
frames are DFT transformed and only the resulting pitch segment magnitude information is 
coded and transmitted. Furthermore, pitch segment magnitude samples are classified as 
strongly or weakly voiced. Thus in addition to voiced/unvoiced information, the system 
transmits for every voiced frame the pitch period value, the magnitude spectral information of 
the pitch segmlent, the strong/weak voiced classification of the pitch rnagnitude $peciral 
values,, and the LPC coefficient. Thus, the information which is transmitted for ever>' voiced 
frame is, in addition to voiced/unvoiced information, the pitch period value, the magnitude 
spectral information of the pitch segment, and the LPC filter coefficients. 

At the receiver a synthesis process, that includes interpolation, is used to reconstmct the 
waveform between the middle points of the current (n+l)th and previous nth frames. The 
basic synthesis equation for the residual signal is: 
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K 

Re = ^ MGj co^hase^.(i)) (2) 
/-o 

where MGj are decoded pitch segment magnitude values and phasej(i) is calculated from the 

integral of the linearly interpolated instantaneous harmonic frequencies C0|(i). K is the largest 
value of j for which C0j\i)<7i. 

In the U:ansitions from unvoiced to voiced, the initial phase for each hamionic is set to zero. 
Phase continuity is preserved across the boundaiies of successive interpolation intervals. 

The synthesis process is performed twice however, once using the magnitude spectral values 
MGj""^' of the pitch segment^crived from the current (nH-l)th frame and again using the 
magnitude values MGj" of the pitch segment derived in the previous nth frame. The phase 
function phasej(i) in each case remains the same. The resulting residual signals ReSj^(i) and 
ReSn^.|(i) are used as inputs to corresponding LPC synthesis filters calculated for the nth and 
(n+l)th speech frames. The two LPC synthesised speech waveforms are then weighted by 
Wn+,(i) and Wn(i) to yield the recovered speech signal. 

Thus the overall synthesis process, for successive voiced frames, can be described by: 

5(0 = ^^.(oi;//''(co';(/)>/G'; cos[/7w;(/)4.<p''(co';(o)l 

■ (3) 

where H"((o"(i)^ is the frequency response of the nth frame LPC synthesis filter calculated, 
at the C0j"(i) harmonic frequency function at the ith instant. (p"^co"(/)) is the associated 

phase response of this filter. <^ fO) and phasej"(i) are the frequency and phase functions 
defined for the sampling instants i, with i covering the middle of the nth frame to the middle 
of the (n+l)th frame segments. K is the largest value of j for which C0|'*(i)<n. 
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The above speech synthesis process introduces two "phase dispersion" terms i.e. ')(/)) 
and <p"**^"(/)) which effectively reduce the degree of pitch periodicity in the recovered 

signal. In addition, this "double synthesis" arrangement followed by an overlap-add process 
ensures an effective smooth evolution of the speech spectral envelope (LPC) on a sample by 
sample basis. 

The LPC excitation signal is based on a "mixed" excitation model which allows for the 

appropriate mixing of periodic and random excitation components in voiced frames on a 

frequency-band basis. This is achieved by operating the system such that the magnitude 

spectrum of the residual signal is examined, ajid applying a peak-picking process, near the coj 

resonant frequencies, to detect possible dominant spectral peaks. A peak associate'd with a 

frequency coj indicates a high degree of voicing (represented by hvp l ) for that harmonic. The 

absence of an adjacent spectral peak, on the other hand, indicates a certain degree of 

randomness (represented by hvj=0). When hvj=l (to indicate ''strong" voicing) the 
contribution of the jth harmonic to the synthesis process is MG ^ cosQ^hase .{i)^ Ho\^ever, 

when hvj=0 (to indicate "weak" voicing) the frequency of the jth harmonic is slightly 
dithered, its magnitude MG ^ is reduced to (mG^ I ^/2^ and random cosine terms are added 

symmetrically alongside the jth harmonic Oj. The terms ''strong'' and "weak" are used in this 
sense below. The number NRS of these random terms is 



(4) 



An X {50/ fs) 

where [ "j indicates rounding off to the next larger- integer value. Furthermore, the NRS 
random components are spaced at 50 Hz intervals symmetrically about coj cOj being located in 

the middle of such a 50 Hz interval. The amplitudes of the NRS random components are set 
to (mGj I yj2 X NRS^ Their initial phases are selected randomly from the [-n, +7t] region at 

pitch period intervals. 
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The hvj information must be transmitted to be available at the receiver and, in order to reduce 
the bit rate allocated to hVj, the bandwidth of the input signal is divided into a number of 
fixed size bands BD^ and a "strongly" or "weakly" voiced flag Bh\^ is assigned for each 
band. In a "strongly" voiced band, a highly periodic signal is reproduced. In a "weakly" 
voiced band, a signal which combines both periodic and aperiodic components is required. 
These bands are classified as strongly voiced (BhVi^=l) or weakly voiced (Bhvj^=0) using a 
majority decision rule approach on the hvj classification values of the harmonics coj contained 
within each frequency band. 

Further restrictions can be imposed on the strongly/weakly voiced profiles resulting from the 
classification of bands. For example, the first X. bands may always be strongly voiced i.e. 
hVj=l for BDj^ with k=l,2,...,X, and X being a variable. The remaining spectral bands can be 
strongly or weakly voiced. 

Figure 1 schematically illustrates processes operated by the system encoder. These processes 
are referred to in Figure I as Processes I to VII and these terms are used tliroughout this 
document. Figure 2 represents the relationship , between analysis/coding frame sizes 
employed. These arc M samples per coding frame, e.g. 160 samples per frame, and k frames 
are analysed in a block, for example k=4. This block size is used for matrix quantization. A 
speech signal is input and processes I, III, IV, VI AND VII produce outputs for transmission. 

Assuming that the first Matrix Quantization analysis frame (MQA) of kxM samples is 
available, each of the k coding frames within the MQA is classified as voiced or unvoiced 
(V„) using, Process I. A pitch estimation part of Process I provides a pitch period value P„ 
only when a coding frame is voiced. 
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Process II operates in parallel on the input speech samples and estimates p LPC filter 
coefficients q (for example p=10) every L samples (L is a multiple of M i.e. L=mxM, and m 
may be equal to for example 2). In addition, k/m is an integer and represents the frame 
dimension of the matrix quantizer employed in Process III. Thus the LPC filter coefficients 
are quantized, using Process III and traiismitted. The quantized coefficients g are used to 
derive a residual signal R'^(i). 

When an input coding frame is unvoiced, the Energy E„ of the residual obtained for this 
fi-ame is calculated (Process VII). Je^ is then quantized and transmitted. 

When the nth coding frame is 'classified as voiced, a segment of P^^ residual samples is 
obtained (P^^ is the pitch period valur associated with the nth frame). This segment is centred 
in the middle of the frame. The selected samples are DFT transformed (Process V) to yield 
[iP„ + l)/2] spectral magnitude values MG" , 0<j<f(/'^, + 1)/2"1, and [{P^, + l)/2l phase 

values. The phase information is neglected. The magnitude information is coded (using 
Process VI) and transmitted. In addition a segment of 20 msecs, which is centred in the 
middle of the nth coding frame, is-^obtained from the residual signal R"(i). This is input to 
Process IV, together with P^ to provide the strongly/weakly voiced classification parameters 
hVj" of the harmonics cO j". Process IV produces quantized Bhv information, which for voiced 
frames is multiplexed and transmitted to the receiver together with the voiced/unvoiced 
decision V„, the pitch period P^, the quantized LPC coefficients a of the corresponding LPC 
frame, and the magnitude values MG" , In unvoiced frames only the 7^ quantized value 
and the quantized LPC filter coefficients a are transmitted. 

Figure 3 schematically illustrates processes operated by the system decoder. In general terms, 
given the received parameters of the nth coding frame and those of the previous (n-l)th 
coding frame, the decoder synthesises a speech signal S^(i) that extends from the middle of 
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the (n-l)th frame to the middle of the nth frame. This synthesis process involves the 
generation in parallel of two excitation signals Res^Ci) and Res^.iCi) which are used to drive 
two independent LPC synthesis filters l/A,,(z) and 1/ A,,,^iz) the coefficients of which are 
derived from the transmitted quantized coefficients q . The outputs X^Ci) and Xn.,(i) of these 
synthesis filters are weighted and added to provide a speech segment which is then post 
filtered to yield the recovered speech S^^{\), The excitation synthesis process used in both 
paths of Figure 3 is shown in more detail in Figure 4. 

The process commences by considering the voiced/unvoiced status V^, where k is equal to n 
or n-1, (see Figure 4). When the frame is unvoiced i.e. V,.=0, a gaussian random number 
generator RG(0,1) of zero mean and unit variance, provides a time series which is 
subsequently scaled by the value received for this frame. This is effectively the 

required: 

Rcs,{i)^ Je^x RG(OA) (5) 

signal which is then presented to the corresponding LPC synthesis filter 1 / (z) , k=n or n- 
1. Performance could be increased if the ^^7- value was calculated, quantized and 

transmitted every Smsecs, Thus, provided that bits are available when coding unvoiced 
speech, four , S,=0,..,3, values are transmitted for every unvoiced frame of 20msecs 

duration (160 samples). 

In the case where Vj.= l, the ReS)^(i) excitation signal is defined as the summation of a 
"harmonic" Resi^^(i) component and a "random" Res/(i) component. The top path of the 
V(^=l part of the synthesis in Figure 4, which provides the harmonic component of this mixed 
excitation model, calculates always the instantaneous harmonic frequency function a)j"(i) 
which is associated with the interpolation interval that is defined between the middle points 
of the nth and (n-l)th frames, (i.e. this action is independent of the value of k). Thus, when 
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decoding the nth frame, C0j"(i) is calculated using the pitch frequencies f^' ", f^^*" and linear 
interpolation i.e. 

~M (6) 



CO^O) = 271 



with 0 < ; <r(/*„„ + 1) / 2\ 0<i<M and P.,.,,, = max[/>„ . P„., } 
The frequencies, fj' " and fj^"" are defined as follows: 

I) When both the nth and (n-l)th coding frames are voiced i.e V, = l and Vn.,=l , then the 
pitch frequencies are estimated as follows: 
a)If |/>„-P„.,|< 02 x(/^, + />„.,) 

which means that the pitch values of the nth and (n-l)th coding frames are rather 



similar, then: 



(8) 



[fj 



otherwise 



(9) 



The f^-^'^ value is calculated during tlie decoding process of the previous (n-l)th 
coding frame, hvj" is the strongly/weakly voiced classification (0, or 1) of the jth 
harmonic coj". and P^.| are the received pitch estimates from the n and n-l frames. 
RU(-a,+a) indicates the output of a random number generator with uniform pdf within 
the -a to +a range, (a=0.OO375) 
b)if [?„ > 0.2 x(/^, + ' . (10) 

then f'/ = / i- - ft] + (i - hv'] )x RUi-a^+a) 

and //^' =/;•"-' +/>xy 
where b is defined as: 



(H) 







n 


p p 


2 



(12) 
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Notice that in case (b) which applies for significantly different and P,,., pitch estimates, 
equations 1 1 and 12 ensure that the rate of change of the C0j"(i) function is restricted to 



P P 



'M. 



II) When one of the two coding frames (i.e. n, n-1) is unvoiced, one of the following two 
definitions is applicable: 
a) for V„.i=0 and Vn=l 

'''' - — 0< J < 



f^" = — / 



p.. + \ 



and fj'" is given by Equation (8). 
b)forV„.,=l and V„=0 

f^'" is set to the fj'"""' value, which has been calculated during the decoding process of 
the previous (n-l)th coding frame and fj'"" = fj^'". 



Given a>j"(i) the instantaneous function phasej"(i) is calculated by: 
phase'; = 271 + 2nfj "i + phase"^' ( h4) 



2M 



for 0<j < 
and 0<i<M 



(13) 



Furthermore, the "harmonic" component Re.v^'(/) of the residual signal is given by: 
Rej* (/) = (/) X A/G* (/jv* ) X cos[phase'; (/)] O^i < M (14) 



,/-0 

where k=n or n- 1 , 

fo 



C,(0 



a^g;(^v;) = 



1 //co';(0<n' 

(mGI y (V2 ) for hv] =0 
for hv) = I 

0 



A/G* 



and 



\<J< 



P, +1 



Otherwise^ including J - 0 



and 
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A/G* 7 =0,...,L(P;t +l)/2j-l are the received magnitude values of the "kth" coding 
frame, with k=n or k=n- 1 . 
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The second path of the V^.=l case in Figure 4 provides the random excitation component 
Re^[(/). In particular, given the recovered strongly/weakly voiced classification values hvj*', 
the system calculates for those harmonics with hvj^=0 the number of random sinusoidal NRS 
components, which are used to randomise the corresponding harmonic. Tills is: 



NRS = 2 X 



CO. 



471 X (50/ fs) 



(IS) 



where fs is the sampling frequency. Notice that the NRS random sinusoidal components Eire 
located symmetrically about the corresponding harmonic co * and they are spaced 50 Hz 

apart. 

r 

The instantaneous frequency of the qth random component, q=0,l ,...,NRS-1 , for the jth 
harmonic co * is calculated by: 



03*^/0 = 03^.(0 + 271 X (25/fs)-^(q - (NRS/2))x 2n x (50/ fs for 0 < 7 < 



(16) 



and'O <i<M 



The associated phase value is: 
2M 



for 0< j < 



K.. +1 



(17) 



and 0<i< M 



where <p = RlfOiy-T^) ■ In addition, the Ph[,^{i) function is randomised at pitch intervals 

(i.e. when the phase of the fundamental harmonic component is a multiple of 2n, i.e. 
moA{phase1 (/), 2-k )= 0 ). 



Given the Ph]^{i) , the random excitation component Res^.r(i) is calculated as follows: 

Re5:(o = Z Zc',.v(o X a<^g;^,(/iv; > x cos(pa;.,(o) o < / < a/ (m 



where 
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far Av'l = 0 
Jbr hv) = 1 



and 



+1 



otherMHse, including y = 0 



1 



co*^(0<n 



Thus for Vjc=l voiced coding frames, the mixed excitation residual is formed as: 
Re5'^(f) = Re5j(0 + Re5';(/) (19) 

Notice that when V^=0, instead of using Equation 5, the random excitation signal ReS|,(i) can 
be generated by the summation of random cosines located 50 Hz apart, where their phase is 
randomised every X samples, and X<M, i.e 



7-1 



(20) 



^ =0,l,2,.,.,artrf0^/< M 



and 



C, is defined so as to ensure that the phase of the cos terms is randomised every X samples 
across frame boimdaries. The resulting Res^Ci) and Res^.^i) excitation sequences, see Figure 
4, are processed by the corresponding 1 / A^,{^) and 1 / A„^^(z) LPC synthesis filters. When 
coding the next (n+l)th frame. l/^„.,(z) becomes 1/ A„(z) (including the memory) and 
1 / A„(z) becomes 1/ A„^y(z) with the memory of 1 / A„(z) . This is valid in all cases except 
during an unvoiced to voiced transition, where the memory of the 1 / A,^^] (z) filter is sei to 
zero. The coefficients of the 1/A„(z) and l/A^,_^{z) synthesis filters are calculated directly 
from the nth and (n-l)th coding speech frames respectively, when the LPC analysis frame 
size L is equal to M samples. However, when Lv^M (usually L>M) linear interpolation is used 
on the filter coefficients (defined every L samples) so that the transfer function of the 
synthesis filter is updated every M samples. 
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The output signals of these filters, denoted as X,,.,(i) and X^Ci), are weighted, overlapped and 
added as schematically illustrated in Figure 5 to yield X„{i) i.e: 



^,.(0 = K-y (0^,.-, (0 + K(OX„(i) 



where 



0.54 



-0.46cos^ 



271 



2M-\ 



0 

0.5 



/ /-0.25AY^ 

- 03 COS 71 



for 0<i<M 

for 0<i <0.25M 
for 0.25M < i < O.ISM 
for 0.7 5 M < M 



and 



034 — 0.46 cos 



V 2M-1 



< M 



1 

03 + 
0 



0.5cos( TZ - 

\ ( 



- 0.25 M 



05M-\ 



j forO<i 

forO<i<0.25M 
for 0.25M <i <0.75M > 
for 0.75M<i < M 



when V„ = V^^_ 



when V„ ^ V„_^ 



(21) 



when V„ = V„,^ 



when K. ^ ^ I 



(22) 



X„{i) is then filtered via a PF(z) post filter and a high pass filter HP(z) to yield the speech 
segment S*n(i), PF(2) is the conventional post filter: 



(23) 



with b=0.5, c=0.8 and [i~0,5Kl'.K" is the first reflection coefficient of the nth coding 
frame. HP(z) is defined as: 
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with bi=Ci=0,9807 and a, =0.96148 1 . 

In order to ensure that the energy of the recovered S(i) signal is preserved, as compared to 
that of the Xii) sequence, a scaling factor SC is calculated every LPC frame of L samples. 

5C,=J| - (25) 

where: £/ = 21^/(0' and E, = 

SC, is associated with the middle of the 1th LPC frame as illustrated imFigure 6. The filtered 
samples from the middle of the (1-1 )lh frame to the middle of the 1th frame are then multiplied 
by SC,(i) to yield the fmal output of the system, S|(i)=SC|(i)xS',(i) where: 

5C,(i)-5C,PF;(0 + SQ.(Pr,.,(f) 0<i<L (26) 

t. 

and 

W,ii) = 0.5 - 05cbs(^7i . 0 < / < I 

(/) = 0.5 + 05co^n Y^-jj 0 < / < L 

The scaling process introduces an extra half LPC frame delay into the coding-decoding 
process. 

The above described energy scaling procedure operates on an LPC frame basis in contrast to 
both the decoding and PF(z), HP(z) filtering procedures which operate on the basis of a frame 
of M samples. 



Details of the coding processes represented in Figure 1 will now be described. 
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Process I derives a voiced/unvoiced (VAJV) classification V,, for the nth input coding frame 
and also assigns a pitch estimate P^^ to the middle sample M,, of this frame. This process is 
illustrated in Figure 7. 

The V/UV and pitch estimation analysis frame is centred at the middle M^+i of the (n+I)th 
coding frame with 237 samples on either side. The signal x(i) in the above amalysis frame is 
low pass filtered with a cut off frequency f^.= 1.45KH2 and the resulting (-147, 147) samples 
centred about M^+i are used in a pitch estimation algorithm, which yields an estimate Pf^„^,. 
The pitch estimation algorithm is illustrated in Figure 8, where P represents the output of the 
pitch estimation process. The 294 input samples are used to calculate a crosscorrelation 
function CR(d), where d is shown in Figure 9 and 20<d<147. Figure 9 shows the two-speech 
segments which participate in the calculation of the crosscorrelation function value at "d" 
delay. In paiticular, for a given value of d, the crosscorrelation function p'^O) is calculated 
for the segments {Xl}^ , {XR}^.as: 



Xl"* (i)=x(M„^.i-d+j+i), Xr^ (i)=x(M„^rt+i)r for 0<i<d-j-l, j=0,l f(d) (Figure 10 

schematically represents the X"! and X'l^ speech segments used in the calculation of the 
value CR{d)2cad the non linear relationship between d and f(d) is given in Figure 1 1 Xl and 
x^ represent the mean value of the (x^) ^ and {xj^}^ sequences respectively. 

The algorithm then selects max[pd(i)] and defines CR(d)= max [p^G)]i 20<d<147. 

In addition to CR(d), the box in Figure 8 labelled "Calculation of CR function and selection 
of its peaks", whose detailed diagram is shown in Figure 12, provides also the locations loc(k) 




'i:'((-^(o-x-X'«(')-^o) 

/■o 



(27) 




v^ere: 
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of the peaks of the CR(d) function, where k=l,2 Np and Np is the number of peaks in a 

CR(d) function. 

Figure 12 is a block diagram of the process involving the calculation of the CR function and 
the selection of its peaks. As illustrated, given CR(d), a tlireshold th(d) is determined as: 

(hid) = cR(dx: )-b-(d- d::: ya-c as) 

where c=0.08 when (K,; = \)AND^d:„ - P;] < 0.15 x (d:,^ + P,:)jpRiK.-i = D] 

AND(d > 0.875 x P')AND(d < 1.125 x />„') 

or c=0 elsewhere, 
and constants a and b are defined as: 



b 


0.025 


0.04 


0.05 


a 


0,0005 


0.0005 


0.0006 


V'n 


1 


1 


0 


Vn-1 


I 


0 


1/0 



d"^^ is equal to the value of d for which CR(d) is maximised to CR[^r . Using this threshold 
the CR(d) function is clipped to CRL(d). i.e. 
CRL(d)=0 for CR(d)<th(d) 

CRL(d) =CR(d) otherwise. 

CR^Cd) contains segments s= 1,2,3 , of positive values separated by Go runs of zero 

values. The algorithm examines the length of the Go runs which exist between successive G^ 
segments (i.e. Gj and Gg+i), and when G(, < 17, then the G^ segment with the max CR^Xd) 
value is kept. This procedure yields CR, {d) , which is then examined by the following "peak 
picking" procedure. In particular those CR, {d) values are selected for which: 
CR^{d)>CR, {d'\) and CR,Xd) > CR^ id \) 
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However certain peaks can be rejected if: 
CR,(locik))< CR^Qocik + l))x 0.9 

This ensures that the final CR, Qoc{k)) k==l,...,Np does not contain spurious low level 
CRf id) peaks. The locations d of the above defined CRf (d) peaks are given by loc(k) 
k=l,2,...,Np. 

CR(d) and loc(k) are used as inputs to the following Modified High Resolution Pitch 
Estimation algorithm (MHRPE) shown in Figure 8, whose output is Pm,,*,- The flowchart of 
this MHRPE procedure is shown in Figure 13, where P is initialised with 0 and, at the end, 
the estimated P is the requested Figure 13 the main pitch estimation procedure is 

based on a Least Squares Error (LSE) algorithm which is defined as follows: 
For each possible pitch value j in the range from 21 to 147 with an increment of 0.1 x y . i.e. 
/e{21,23,25;273033,36,40,44,48,53,58:64 ,70,77.84,92,101,1 1 1,122,134) . (Thus 21 iterations 
are performed.) 

1) Form the multiplication factor vector: 

u . 

2) Reject possible pitch j and go back to (1) if 

a) the same element occurs in U| twice. 

6^ the elements of Uj have as a common factor a prime number. 

3) Form the following error quantity 

E- = loc^ loc ~2p .u / loc p]u / u • 

where 

lOC^U: 

U^Uj 

4) Select the pjj value for which the associated Error quantity E|<; is minimum, 
(i.e.yy.f^, < Ej Vy € {2 1,23,... 1 34}). Set P^pj,. 
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The next two general conditions "Reject Highest Delay" locCNp) and "Reject Lowest Delay" 
loc(l) are included in order to reject false pitch, "double" or "half values and in general to 
provide constraints in the pitch estimates of the system. The "Reject Highest Delay" condition 
involves 3 constraits: 

i) ifP=0 then reject loc(Np). 

ii) if loc(Np) >100 then find the local maximum CR(dtn,) in CR(d) at the vicinity of the 
estimated pitch P (i.e 0.8xP to 1.2xP) and compare this with th(d|r„), which is determined 
as in Equation 28 Reject loc(Np) when CR(di,„)<th(d,^)-0.02. 

Hi) If the error Ejj of the LSE algorithm is larger than 50 and t^^, (Np)=Np with Np>2 then 

reject loc(Np). 
The flowchart of this is given in Figure 14. 

The "Reject Lowest Delay" general condition, whose flowchart is given in Figure 15, rejects 
loc(l) when the following three constraints are simultaneously satisfied: 

i) The density of detection of the peaks of the correlation coefficient function is less than or 

equal to 0.75. i.e. 

-^<0.75 
u,.,iNp) 

ii) If the location of the first peak is neglected (i.e. loc(l)), then the remaining locations 
exhibit a common factor. 

Hi) The value of the correlation coefficient function at the locations of the missing peaks is 
relatively small compared to adjacent detected peeiks. i.e. 
If Upn''-up„(k)>l , for k=l ,...Np. then 
fori=Up„(k)+l : up„(k+l)-l 

a) find local maximum CR(d|,„) in the range from (i-O.l)xioc(l) to (i+0.1)x 
loc(l). 

b) if CR(d, J <0.97xCR(Upn(k)) then Reject Lowest Delay. END. 

else Continue 
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This concludes the pitch estimation procedure of Figure 7 whose output is P^,,^,. As is also 
illustrated in Figure 7 however, in parallel to the pitch estimation. Process I obtains 160 
samples centred at the middle of the M^^^ coding frame, removes their mean value, and then 
calculates RO, Rl and the average R^^ of the energies of the previous K non-silence coding 
frames. K is fixed to 50 for the first 50 non-silence coding frames, increases from 50 to 100 
with the next 50 non-silence coding frames, and then remains constant at the value of 100. 
The flowchart of the procedure that calculates R^y, Rl, RO and updates the R^^ buffer is 
shown in Figure 16» where "Count" represents the number of non-silence speech frames, and 
denotes increase by one. Notice that TH is an adaptive threshold that is representative 
of a silence (non speech) frame and is defined as in Figure 17. CR in this case is equal to 

Given RO, Rl, R^^ and C/?,"" , the V/UV part of Process I calculates the status V^^,,;, of the 
n+1 frame. The flowchart of this part of the algorithm is shown in Figure 18 where "V" 
represents the output V/UV flag of this procedure. Setting the "V" flag to I or 0 indicates 
voiced or unvoiced classification respectively. The ''CR" parameter denotes the maximum 
value of the CR function which is calculated in the pitch estimation process. A diagrammatic 
representation of the voiced/unvoiced procedure is given in Figure 19. 

Having the V^^^,, value, the P^^„;^ estimate and the V\, and estimates which have been 
produced from Process I operating on the previous nth coding fi-ame, as illustrated in Figure 
7, part b, two further locations M^+j+dl and Mn+,+d2 are estimated and the corresponding 
[-147,147] segments of filtered speech samples are obtained as illustrated in Figure 7, part b. 
These additional two analysis frames are used as input to the "Pitch Estimation process" of 
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Figure 8 to yield PMnH+di PMn*i+d2* The procedure for calculating dl and d2 is given in the 
flowchart of Figure 20. 

The final step in part (a) of Process I of Figure 7, evolves the previous V/UV classification 
procedure of Figure 8 vnih inputs RO, Rl, R^y, and: 
CR = max]CRZZ^ ^^/f- .CRZ^.,, ] 
to yield a preliminary value VfJ'^ . 

In addition, a multipoint pitch estimation algorithm accepts Pm„.p PMn*i+di> PM«>i+d2'^n.i» 
Pn-i3 ^'n' P'n provide a preliminary pitch value PJl''^ . The flowchart of this multipoint pitch 
estimation algorithm is given in Figure 2 L,. where Pi, P2 and P^ represent the pitch estimates 
associated vnth the Mn+i+ji, M„+i +^,2 and Mn+r points respectively, and P denotes the output 
pitch estimate of the process, that is Pn+. I . 

Finally part (b) Process I of Figure 7 imposes constraints on theF,,7i and PJl\ estimates in 
order to ensure a smooth evolution for the pitch parameter. The flowchart of this section is 
given in Figure 22. At the start of this process *'V" and "P" represent the voicing flag and 
pitch estimate values before constraints are applied iVf^\ and P,ll\ in Figure 7) whereas at 
the end of the process "V" and "P" represent the voicing flag and pitch estimate values after 
the constraints have been applied (K;^, and /^;,,). The V'„^., and F^^^ produced from this 
section are then used in the next pitch past processing section together with V^.j, V'^, Pj,., and 
P'n to yield the fmal voiced/unvoiced and pitch estimate parameters V^, and P^ for the nth 
coding frame. This pitch post processing stage is defined in the flowchart of Figures 23, 24 
and 25, the output A of Figure 23 being the input to Figure 24, and the output B of Figure 24 
being the. input to Figure 25. At the start of this procedure "P^" and "V^" represent the pitch 
estimate and voicing flag respectively, which correspond to the nth coding frame prior to post 
processing (i.e. , ) whereas at the end of the procedure and "V^" represent the 
final pitch estimate and voicing flag associated with the nth frame (i.e. P„ , V J. 
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The LPC analysis process (Process II of Figure \) can be performed using the 
Autocorrelation, Stabilised Covariance or Lattice methods. The Burg algorithm was used, 
although simple autocorrelation schemes could be employed without a noticeable effect in the 
decoded speech quality. The LPC coefficients are then transformed to an LSP representation. 
Typical values for the number of coefficients are 10 to 12 and a 10th order filter has been 
used. LPC analysis processes are well known and described in the literature, for example 
"Digital Processing of Speech Signals", L.R. Rabiner and R.W. Schafer, Prentice - Hall Inc., 
Englewood Cliffs, New Jersey, 1.978. Similarly, LSP representations are well' known, for 
example from "Line Spectrum Pair and Speech Data Compression'', F Soong and B.H. Juang, 
Proc. ICASSP-84, pp 1. 10.1-1.10.4, 1984. Accordingly these processes and representations 
will not be described further in this document: 

In process II, ten LSP coefficients are used to represent the data. These 10 coefficients could 
be quantized using scalar 37 bits with the following bit allocation pattern [3,4,4,4,4,4,4,4,3,3]. 
This is a relatively simple process, but the resulting bit rate of 1850 bits/second is 
unnecessarily high. Alternatively the LSP coefficients can be Vector Quantised (VQ) using a 
Split-VQ technique. In the Split-VQ technique an LSP parameter vector of dimension "p" is 
split into two or more subvectors of lower dimensions and then, each subvector is Vector 
Quantised separately (when Vector Quantising the subvectors a direct VQ approach is used). 
In effect, tiie LSP transformed coefficient vector, C, which consists of "p" consecutive 
coefficients (ci,c2,...,Cp) is split into "K" vectors, C^ (l<k<K), with tiie corresponding 
dimensions dk (l<dk<p). p=di+d2+...+dic. In panicular, when "K" is set to "p'* (i.e. when C 
is partitioned into "p" elements) tiie Split-VQ becomes equivalent to Scalar Quantisation, On 
the other hand, when K is set to unity (K=l, dk=p) the Split-VQ becomes equivalent to Full 

Search VQ. 
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The above Split VQ approach leads to an LPC filter bit rate of the order of 1.3 to 
1 .4Kbits/sec. In order to minimize further the bit rate of the voice coded system described in 
this document a Split Matrix VQ (SMQ) has been developed in the University of Manchester 
and reported in "Efficient Coding of LSP Parameters using Split Matrix Quantisation", 
C.Xydeas and C.Papanastasiou, Proc ICASSP-95, pp 740-743, 1995. Tliis method results in 
transparent LPC quantisation at 900bits/sec and offers a flexible way to obtain, for a given 
quantisation accuracy, the required memory/complexity characteristics for Process IIL An 
important feature of SMQ is a new weighted Euclidean distance which is defined in detail as 
follows. 



m{k)-\ 
v-O 



(29) 



where L'^ (1) represents the kth (k=l,...,K) quantized submairix and are its 

elements, m(k) represents the spectral dimension of the kth submatrix aiid N is the SMQ 
frame dimension. Note also that : 5'( A) = ^ m{j) , m(0) = 1 and ^ m( A ) = p 

/«=0 As! 

w^O) ^ [(I - £r(0) . — — .£,,(0"' /or transmission frames 0 < ( < N - \ (30) 

1_ Aver (En) j 

when the N LPC frames consist of both voiced and unvoiced frames 
W((t) = Eniif^ otherwise 

where Er(t) is the normalised energy of tlie prediction error of the (l+t)th frame, En(t) is the 
RMS value of the (l+t)th speech frame and Aver(En) is the average RJVIS value of the N LPC 
frames used in SMQ. The values of the constants a and al are set to 0.2 and 0.15 
respectively. 
Also: 

H.,^(^,o=|i5/'^r-.).T (31) 
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where PC/^,,"*') is the value of the power envelope spectrum of the (1+t) speech frame at the 
Ik+s LSP^^_,^^, frequency, p is equal to 0.15 

The overall SMQ quantisation process that yields the quantised LSP coefficients vectors / < 
to / '+N-' for the 1 to i+N-1 analysis frames is shown in Figure 26. Tliis figure also includes 
the inverse process, which accepts the above / vectors i=0,..,N-l and provides the 
corresponding LPC coefficients vector to a'*"''. The q'^' i=0,..,N-l, coefficients 
vectors are modified, prior to the LPC to LSP transformation, by a 10 Hz bandwidth 
expansion as indicated in Figure 26. A 5Hz bandwidth expansion is also included in the 
inverse quantisation process. 

Process IV of Figure 1 will now be described. This process is concerned with the mixed 
voiced classification of harmonics. When the nth coding frame is classified as voiced, the 
residual signal Rn(i) of length 160 samples centred at the middle M„ of the nth coding frame 
and the pitch period P^ for that frame are used lo determine the strongly voiced 
(hVj=l)/weakly voiced (hvj=0) classification associated with the jth harmonic toj". The 
flowchart of Process IV is given in Figure 27. The R" array of 160 samples is Hamming 
windowed and augmented to form a 512 size array, which is then FFT processed. The 
maximum and minimum values MGR^^ax. MGRn,j„ of the resulting 256 spectral magnitude 
values are determined, and a threshold THO is calculated. THO is then used to clip the 
magnitude spectrum. The clipped MGR array is searched to define peaks MGR(P) satisfying : 

MGR(P)>MGR(P+1 ) and MGR(P)>MGR(P- 1 ) 

For each peak, MGR(P), "supported" by the MGR(P+1) and MGR(P-l) values a second order 
polynomial is fitted and the maximum point of this curve is accepted as MGR(P) with a 
location loc(MGR(P)). Further constraints are then imposed on these magnitude peaks. In 
particular peaks axe rejected : 
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a) if there are spectral peaks in the neighbourhood of loc(MGR(P)) (i.e in the range 
(loc(MGR(P))-fo/2 to loc(MGR(P))+fo/2 where fo is the fundamental frequency in Hz), 
whose value is larger than 80% of MGR(P) or 

b) if there are any spectral magnitudes in the same range whose value is larger than MGR(P). 
After applying these two constraints the remaining spectral peaks are characterised as 
"dominant" peaks. The objective of the remaining part of the process is to examine if there is 
a "dominant" peak near a given harmonic Jxcoq, in which case the harmonic is classified as 
strongly voiced and hvj=l, otherwise hvj=0. In particular, two thresholds are defined as 

follows: 

THl=0.l5xfo, TH2=(1.5/Pp)xfo 
with fo=(l/Pn)xfs and fs is the sampling frequency. 

The difference (1oc(MGR,(*))- Ioc(MGR, (A - 1)) is compared to I.5xfo-fTH2, and if 

larger a related harmonic is not associated with a "dominant" peak and tlie corresponding 

classification hv is zero (weakly voiced). (ioc(MGR^ (A:)) is the location of the kth dominant 

peak and k=l,».,D where D is the number of dominant peaks. This procedure is described in 
detail in Figure 28, in which it should be noted that the harmonic index j does not always 
conespond to the magnitude specmim peak index k, and loc(k) is the location of the kth 
dominant peak^ i.e. loc (MGRd(k)) = ioc(K). 

In order to minimise the bit rate associated with the transmission of the hvj information, two 
.schemes have been employed which coarsely represent hv. 

The spectrum is divided into bands of 500Hz each and a strongly voiced/weakly voiced flag 
Bhv is assigned for each band. The first and last 500Hz bands i.e. 0 to 500 and 3500 to 
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4OOOH2 are always regarded as strongly voiced (Bhv=l) and weakly voiced (Bhv=0) 
respectively. When Vj,==l and Vr,., = l the 500 to 1000 Hz band is classified as voiced i.e. 
Bhv=l. Furthermore, when V^=l and V,,^,=0 the 3000 to 3500 Hz band is classified as 
weakly voiced i.e. Bhv=0. The Bhv values of the remaining 5 bands are determined using a 
majority decision rule on the hvj values of the j harmonics which fall within the band under 

consideration. When the number of harmonics for a given band is even and no clear majority 
can be established i.e. the number of harmonics with hVj=l is equal to the number of 
harmonics :with hVj=0, then the value of Bhv for that band is set to the opposite of the value 
assigned to the immediately preceding band. At the decoding process the hVj of a specific 
harmonic j is equal to the Bhv value of the corresponding band. Thus the hv information may 
be transmitted vAth 5 bits. 

Scheme II 

In this case the 680 Hz to 3400 Hz range is represented by only two variable size bands. 
When V,^=l and Vj^.i=0 the Fc frequency that .separates these two bands can be one of the 

following: 

(A) 680,. 1360, 2040, 2720. 

whereas, when Vf^=l and V„.i = l, Fc can be one of the following frequencies: 

(B) 1360, 2040, 2720, 3400. 

Furthermore, the 0 to 680 and 3400 to 4000 Hz bands are always represented with Bhv=l and 
Bhv=0 respectively. The Fc firequency is selected by examining the three bands sequentially 
defined by the frequencies in (A) or (B) and by using again a majority rule on the harmonics 
which fall within a band. When a band with a mixed voiced classification Bhv=0 is found, i.e. 
the number of harmonics with hvj=0 is larger than to the number of harmonics with hvpl, 

then Fc is set to the lower boundary of this band and the remaining spectral region is 
classified as Bhv=0. In this case only 2 bits are allocated to define Fc. The lower band is 
strongly voiced with Bhv=l, whereas the higher band is weakly voiced with Bhv=0. 
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To illustrate the effect of the mixed voice classification on the speech synthesised from the 
transmitted information. Figures 29 and 30 represent respectively, an original speech 
waveform obtained for the utterance "Industrial shares were mostly a'' and frequency tracks 
obtained for that utterance. The horizontal axis represents time in terms of frames each of 
20msec duration. Figure 31 shows to a larger scale a section of Figure 30, and represents 
frequency tracks by frill lines for the case when the voiced frames are all deemed to be 
strongly voiced (hv==l) and by dashed lines when the strongly/weakly voiced classification is 
taken into account so as to introduce random perturbations when hv=0. 

Figure 32 shows four waveforms A, B, C and D. Waveform A represents the magnitude 
spectrum of a speech segment and the corresponding LPC spectral envelope (log,o domain). 
Waveforms B, C and D represent the normalised Short-Term magnitude spectrum of the 
corresponding residual segment (B), the excitation segment obtained using the binary 
(voiced/un voiced) excitation model (C), and the excitation segment obtained using the 
strongly voiced/weakly voiced/unvoiced hybrid excitation model (D), It will be noted that 
the hybrid model introduces an appropriate amount of randomness where required in the 3n/4 
to 71 range such that curve D is a much closer approximation to curve B than curve C. 

Process V of Figure 1 will now be described. Once the residual signal has been derived* a 
segment of P„ samples is obtained in the residual signal domain. The magnitude spectrum of 
the segment, which contains excitation source information, is derived by applying a P^ points 
DFT. An alternative solution, in order to avoid the computational complexity of the P„ points 
DFT, is to apply a fix length FFT (128 points) and to find the value of the magnitude 
spectrum at the desired points, using linear interpolation. 



For a real-valued sequence x(i) of P points the DFT may be expressed as: 
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X(k) = X x(i) oos(^) . J X x(0 sm(^) 

The P„ point DFT will yield a double-side spectrum. Thus, in order to represent the excitation 
signal as a superposition of sinusoidal signals, the magnitude of all the non DC components 
must be multiplied by a factor of 2. The total number of single side magnitude spectrum 
values, which are used in the reconstruction process, is equal to f(P,, + 1) / 2"[. 

Process VI of Figure 1 will now be described. The DFT (Process V) applied on the Pn 
samples of a pitch segment in the residual domain, yields f(P„ + 1) / 2"] spectral magnitudes 
(MGj^ 0<]<\{P„^\)/2]) and f(/^, + l)/2'] phase values. The phase information is 
neglected. However, the continuity of the phase betvyeen adjacent voiced frames is preserved. 
Moreover, the contribution of the DC magnitude component is assumed to be negligible and 
thus, MGq" is set to 0. In this way, the non-DC magnitude spectrum is assumed to contain all 

the perceptually important information. 

Based on the assumption of an "approximately" flat shape magnitude spectrum for the pitch 
residual segment, various methods could be used to represent the entire magnitude spectrum 
with a single value. Specifically, a modified single value spectral amplitude representation 
(MSVSAR) technique is described below. 

MSVSAR is based on the observation that some of the speech spectrum resonance and anti- 
resonance information is also present at the residual magnitude spectrum (G.S. Kang and S.S. 
Everett, "Improvement of the Excitation Source in the Narrow-Band Linear Prediction 
Vocoder". IEEE Trans. Acoust., Speech and Signal Proc, Vol, ASSP-33, pp.377-386, 1985). 
LPC inverse filtering can not produce a residual signal of absolutely flat magnitude spectrum 
mainly due to: a) the ''cascade representation" of formats by the LPC filter l/A(2), which 
results in the magnitudes of the resonant peaks to be dependent upon tlie pole locations of the 
1/A(2) alUpole filter and b) the LPC quantisation noise. As a consequence, the LPC residual 
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signal is itself highly intelligible. Based on this observation the MG" magnitudes are 
obtained by spectral sampling at the harmonic locations, C0j'\ j=l [(P^-f l)/2j, of a 

modified LPC synthesis filter, that is defined as follows: 

MPiz)^ 9ji (32) 

where, 5" , i=l,...,p represent the p quantised LPC coefficients of the nth coding frame and 
Gr and are defined as follows: 

Gn-G,^f{{\-K:f (33) 
and 



1 2^'.-l 

. w 

£ (a^/>(co;')//(c3;:)) /2 

where K " , i=l,...,p are the reflection coefficients of the nth coding frame, x,/'"(i) represents a 
sequence of 2Pn speech samples centred in the middle of the nth coding frame from which the 
mean value is being calculated and removed, aV?(co " ) and //(co " ) represent the frequency 

response of the MP(z) and 1/A(z) filters respectively at the cOj" frequency. Notice that the 
MP(co") values are calculated assuming 0^=1. The G^^ parameter represents a constant 

whose value is set to 0.25. 

Equation 32 defines a modified LPC synthesis filter with reduced feedback gain, whose 
frequency response consists of nearly equalised resonant peaks, the locations of which are 
very close, to the LPC synthesis resonant locations. - Furthermore, the value of- the feedback 
gain Gr is controlled by the performance of the LPC model (i.e. it is proportional to the 
normalised- LPC prediction errorj. In addition Equation 34 ensures that the energy of the 
reproduced speech signal is equal to the energy of the original speech waveform. Robustness 
is increased by computing the speech RMS value over two pitch periods. 
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Two alternative magnitude spectrum representation techniques are described below, which 
allow for better coding of the magnitude information and lead to a significant improvement in 
reconstructed speech quality. 

The first of the alternative r magnitude spectrum representations tecliniques is referred to 
below in the "Na amplitude system". The basic principle of this MG" quantisation system is 
to represent accurately those MG'l values which correspond to the Na largest speech Short 
Term (ST) spectral envelope values. In panicular, given the LPC coefficients of the nth 
coding fi-ame, the ST magnitude spectrum envelope is calculated (i.e. sampled) at the 
harmonic frequencies a)" and the locations lc(i), j-l,...,Na of the largest Na spectral samples 
are determined. These locations indicate effectively which of the \{P^^ +l)/2"|- I MG". 

magnitudes are subjectively more important for accurate quantization. The system 
subsequently selects MGj^ j=lc(l), ■..,lc(Na) and Vector Quantizes these values. If the 
minimum pitch, value is 17 saimples. the number of non-DC MG'l amplitudes is equal to 8 
and for this reason Na<8. Two variations of the "Na-amplitudes system" were developed with 
equivalent performance and their block diagrams are depicted in Figure 33 (a) and (b) 
respectively. 

i) Na-amplitudes system with Mean Normalization Factor. In this variation, a pitch segment 
of Pn residual samples R"(i), centered about the middle of the nth coding frame is 
obtained and DFT transformed. The mean value of the spectral magnitudes MG*l , j=^l,-.., 
\{P^^ +1) / 2 J is calculated as: 




(35) 
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m is quantized and then used as the normalizalion factor of the Na selected amplitudes MG" , 
j=lc(l ),..., lc(Na). The resulting Na amplitudes aie then vector quantized to MG" . 

ii) Na-ampiitudes system with RMS Normalization Factor. In this variation the RMS value 
of the pitch segment centered about the middle M„ of the nth coding frame, is calculated as: 



1 




H 




\ 
- 1 

J 




2 



(36) 



g is quantized and then used as the normalization factor of the Na selected amplitudes MG" , 
j=lc(l),...,lc(Na), These normalized amplitudes are then Vector Quantised to A/CJ' . Notice 
that the points DFT operation can be avoided in this case, since the magnitude spectrum of 
the pitch segment is calculated only at the Na selected harmonic frequencies co", 

j-lc(l),...,lc(Na): 

In both cases the quantisation of the m and g factors, used to normalize the MG" values, is 
performed using an adaptive n-law quantiser with a non-linear characteristic as: 

c{A) = A^^ ' ' sgn(^) with ^=25 5 (37) 

log,(l-hn) 

This arrangement for the quantization of g or m extends the dynamic range of the coder to not 
less than 25dBs. 

At the receiver end the decoder recovers the MG" magnitudes as ' MG" - MG]" x A, 
j=lc(l),...,lc(Na). The remaining [{P„ + 1)/2"|- Na - 1 MG" values are set to a constant 
value A. (where A is either "m" or "g")- The block diagram of the adaptive M.-law quantiser is 
shown in Figure 34. 
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The second of the alternative magnitude spectrum representation techniques is referred to 
below as the "Variable Size Spectral Vector Quantisation (VS/SVQ)'' system. Coding 
systems, which employ the general synthesis formula of Equation (1) to recover speech, 
encounter the problem of coding a variable length, pitch dependant spectral amplitude vector 
MG. The "Na- amplitudes" MG'j quantisation schemes described in Figure 33 avoid this 

problem by Vector Quantising the minimum expected number of spectral amplitudes and by 
setting the rest of the MG" amplitudes to a fixed value. However, such a partially spectrally 

flat excitation model has limitations in providing high recovered speech quality. Thus, in 
order to improve the output speech quality, the shape of the entire { MG]'-} magnitude 
spectrum should be quantised. Various techniques have been proposed for coding { MG" }. 
Originally ADPCM has been used across the MG" values associated to a specific coding 
fiame. Also { MG'J } has been DCT transformed and coded differentially across successive 
MGj magnitude spectra. However, these coding schemes are rather inefficient and operate 
with relatively high bit rates. The introduction of Vector Quantisation on the { MG" } spectral 

amplitude vectors allowed for the development of Sinusoidal and Prototype Interpolation 
systems which operate at around 2.4 Kbits/sec. Two known { MG" } VQ methods are 

described below which quantise a variable size (vs^) input vector "with a fixed size (fxs) 

codevector. 

i) The first VQ method involves the transformation of the input vector to a fixed size vector 
followed by conventional Vector Quantisation. The inverse transformation on the quantised 
fixed size vector yields the recovered quantised MG" vector. Transformation techniques 
which have been used include. Linear Interpolation, Band Limited Interpolation, All Pole 
modelling and Non-Square transformation. However, the overall distortion produced by this 
approach is the summation of the VQ noise and a component, which is introduced by the 
transformation process. 
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ii) The second VQ method achieves the direct quantisation of a variable input vector with a 
fixed size code vector. This is based in selecting only vs„ elements from each codebook 
vector, to form a distortion measure between a codebook vector and an input MG" vector. 
Such a quantisation approach avoids the transformation distortion of the previous techniques 
mentioned in .(i) and results in an overall distortion that is equal to the Vector Quantisation 
noise. 

An improved VQ method will now be described which is referred to below as the Variable 
Size Spectral Vector Quantisation (VS/SVQ) scheme. This scheme was developed to take 
advantage of the underlying principle that the actual shape of the { MG" } magnitude 
spectrum is defined by a minimum + I) / 2"] of equally spaced samples. If we consider 
the maximum expected pitch estimate P,„3^, then any { MG'- j spectral shape can be 
represented adequately by f(P„ + 1) / 2") samples. This suggests that the fixed size fxs of the 
codebook vectors S^ representing the A/G)' shapes should not be larger thanf(P^, + 1)/2"|. 
Of course this also implies that given the + l)/2l samples of a codebook vector, the 
complete spectral shape, defined at any frequency, is obtained via an interpolation process. 

Figure 35 highlights the VS/SVQ process. The codebook CBS having cbs fixed fxs 
dimension vectors 5J , j=l,...,fxs and i=l,...,cbs, where fxs isf(/;, + 1) / 2"!, is used to quantise 
an input vector MG" , j=l,.,.,vSn of dimension vs^. Interpolation (in this case linear) is used 
on the S" vectors to yield S" vectors of dimension vs,., . The 5' to ^S;^ interpolation 



process is given by: 




for i=l,...,cbs and j=l,...,vSn 
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This process effectively defines spectral shapes at the co" frequencies of the MG" 
vector. A distortion measure D( S" , MG" ) is then defined between the 5^ and MG*' 
vectors, and the codebook vector that yields the minimum distortion is selected and its 
index I is transmitted. Of course in the receiver. Equation (38) is used to define MG" from 

si. 

If we assume that Pmax^l20 then fxs=60. However this value can be reduced to 50 without 

significant degradation by low pass filtering the signal synthesised from Equation (1 ). This is 
achieved by setting to zero all the harmonics MG'I in the region of 3.4 to 4.0KHz, in which 



case: 
'3400x 



fs 

5{)=vSn otherwise. 



and vSn<fxs. 



ifvs <50 (39) 



Amplitude vectors, obtained from adjacent residual frames, exhibit significant redundancy, 
which can be removed by means of backward prediction. Prediction is performed on a 
harmonic basis i.e. the amplitude value of each harmonic MG;" is predicted from the 
amplitude value of the same harmonic in previous frames i.e. AYG"*' . A fixed linear predictor 
MG " = X A/G ""' may be incorporated in the VS/SVQ system, and the resulting DPCM 
structure is shovm in Figure 36 (differential VS/SVQ, (DVS/SVQ)). In particular, error 
vectors are fomied as the difference between the original spectral amplitudes MG" and their 
predicted ones MG^ , i.e.: 
£; = MG; - MG*; for j=l,...,vs„, 

where the predicted spectral amplitudes MG'^ are given as: 
bxMG'J-' when K,., = 1 



MG" = 



and 



forl<j<vs,., (40) 

0 when K =0 



wo 98/01848 



PCT/GB97/01S31 



60 



MG"j = 



1 




for vsn.,<j<vs„ 



(41) 



iM Jk-i 



Furthermore the quantised spectral amplitudes MG*] are given as: 



for 



(42) 



— Ia/g; 



2 



where E" denotes the quantised error vector. 



The quantisation of the £" l<j<vSn error vector incorporates Mean Removal and Gain Shape 
Qu£Uitisation techniques, using the hierarchical VQ structure of Figure 36. 

A weighted -Mean Square Error is used in the VS/SVQ stage of the system. The weighting 
function is defined as the frequency response of the filter: = 1 / (r /y ) , where Ar^(z^ is 

the short-temi linear prediction filter and y is a constant, defined as y=0.93. Such a weighting 

function that is proportional to the short-term envelope spectrum, results in substantially 
improVed decoded speech quality. The weighting function W"-\s normalised so that: 



The pdf of the mean value of is very broad and, as a result, the mean value differs widely 
from one vector to another. This mean value caii be regarded as statistically independent of 
the variation of the shape of the error vector and thus, can be quantised separately 
without paying a substantial penalty in compression efficiency. The mean value of an error 
vector is calculated as follows: 



M is Optimum Scalar Quantised to M and is then removed ft-om the original error vector to 




(43) 




(44) 



form £rm" = {E^ - M) . The overall quantization distortion is attributed to the quantization 
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of the "Mean Removed" error vectors ( Erm" which is performed by a Gain-Shape Vector 
Quantiser. 

The objective of the Gain-Shape VQ process is to determine the gain value C and the shape 
vector S so as to minimise the distortion measure: 

D( Erm" , G x ^) = ^ ferm; - G x 5 J (45) 

A gain optimised VQ search method, similar to techniques used in CELP systems, is 
employed to find the optimum G and S_. The shape Codebook (CBS) of vectors is 
searched first to yield an index I, which maximises the quantity: 

Qii) = for i= I cbs (46) 



where cbs is the number of codeveciors in tiie CBS. Tlie optimum gain value is defined as: 

G = ^^~r-^ (47) 

/-I 

and is Optimum Scalar Quantised to G . 

During shape quantisation the principles of VS/S VQ are employed, in the sense that the S]_|^ , 
vSp size vectors are produced using Linear Interpolation on fxs size codevectors . Both 
trained and randomly generated shape CBS codebooks were investigated. Although Erm" 
has noise-like characteristics, systems using randomly generated shape codebooks resulted in 
unsatisfactory muffled decoded speech and were inferior to systems employing trained shape 
codebooks. 
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A closed-loop joint predictor and VQ design process was employed to design the CBS 
codebook, the optimum scalar quantisers CBM and CBG of the mean M and gain G values 
respectively, and also to define the prediction coefficient b of Figure 36. In particular, the 
following steps take place in the design process. 

STEP AQ (k=0). Given a training sequence of MGj" the predictor is'' calculated in an open 
loop fashion (i.e. A/G; = 6 x MG;'"' for l<j<f(/^, + 1) / 2] when V„., = l, or 
MGj = 0 elsewhere). Furthermore, the CBM^ mean, CBG^ gain and CBSO shape 

codebooks are designed independently and again in an open loop fashion using 
unquantized E^, In particular: 

a) Given a training sequence of error vectors E^**, the mean value of each E" Q ts 
calculated and used in the training process of an Optimum Scalar Quantiser 
(CBMO). 

b) Given a training sequence of error vectors E^l ^ and the CBM^ mean quantiser, the 
mean value of each error vector is calculated, quantised using the CBMO quantiser 
and removed from the original error vectors ^ to yield a sequence of "Mean 
Removed" training vectors Erm" 

c) Given a training sequence of Erm" vectors, each "Mean Removed" training vector 

is normalised to unit power (i.e. is divided by the factor G = ^^^^"^rmf^ ), 

linear interpolated to fics points, and then used in the training process of a 
conventional Vector Quantiser of fxs dimension. (CBSO). 

d) Given a training sequence of Erm" » vectors and the CBS^ shape codebook, each 
"Mean Removed" training vector is encoded using Equations 46 and 47 and the 
value G of Equation 47 is used in the training process of an Optimum Scalar 
Quantiser (CBGO). 

k is set to 1 (k=l). 
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STEP A1 Given a training sequence of MGj and the mean, gain and shape codebooks of the 
previous k-1 iterations (i.e. CBM^"1, CBG^*^' CBS^'Oi the optimum prediction 

coefficient b*^ is calculated. 
STEP A2 Given a training sequence of MGj , an optimum prediction coefficient b*^ and 

CBM|^"^, CBG^'l' CBSl<^"l, a training sequence of error vectors E^*^ is formed, 

which is then used for the design of new mean, gain and shape codebooks (i.e. CBM^, 

CBGI^ CBSk). 

STEP A3 The performance of the kth iteration quantization system (i.e. b^, CBM^, CBG^, 
CBS*^) is evaluated and compared against the quantization system of the previous 
iteration (i.e. b^-' , CBM^"', CBG^^-'* CBS*^-'). If the quantization distortion 
converges to a minimum, the quantization design process stops. Otherwise, k=k+I 
and Steps A1 » A2 and A3 are repeated. 

The performance of each quantizer (i.e. b^\ CBM'^^, CBG'<^^ CBS^) has been evaluated using 
subjective tests and a LogSegSNR distortion measure, which was found to reflect the 
subjective performance of the system. 

The design for the Mean-Shape-Gain Quantiser used in STEP A2 is performed using the 
following two steps : 

STEP B1 Given a training sequence of error vectors E^^, the mean value of each E|l_*^ is 
calculated and iised in the training process of an Optimum Scalar Quantiser (CBM*^), 

STEP B2 Given a training sequence of error vectors E^*^ and the CBM*^ mean quantizer, the 
mean value of each residual vector is calculated,- quantized and removed from the 
original residual vectors E^^ to yield a sequence of "Mean Removed" training 
vectors Erm" which are then used as the training data in the design of an optimum 
Gain Shape Quantizer (CBG^^ and CBS*^). This involves steps CI - C4 below. (The 
quantization design process is perfomied under the assumption of any independent 
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gain shape quantiser structure, i.e. an input error vector Emir can be represented by 
any possible combination of S' codebook shape vectors and G gain quantizer levels.) 
STEP CI (v=0). Given a training sequence of vectors Erm" ^ and an initial CBG^ o and 

CBS*^*o gain and shape codebooks respectively, compute the overall average distortion 
distance D^ q as in Equation 44. Set v equal to 1 (v=l ). 
STEP C2 Given a training sequence of vectors Erm " ^ and the CBG*^-^*' gain codebook from 
the previous iteration, compute the new shape codebook CBS*^-^ which minimises the 
VQ distortion measure. Notice that the optimum CBS*^*^ shape codebook is obtained 
when the distortion measure of Equation (44) is a minimum and this is achieved in 
Ml|^ V iterations. 

STEP C3 Given a training sequence of vectors Erm" ^ ^nd the CBS*^-^ shape codebook, 
compute a new gain quantiser CBG*^-^, which minimise the distortion measure of 
Equation (44). This optimum CBG^^-^ gain quantiser is obtained when the distortion 
measure of Equation (44) is a minimum and this is achieved in M2|^ ^ iterations. 
STEP C4 Given a training sequence of vectors Erm" ^ and the shape and gain codebooks 
CBS*^^ and CBG*^'^, compute the average overall distortion measure. If (D,^^,_(- 
Dic,v)/Dk v'^^ stop. Otherwise, v=v-hl and go back to STEP C2 . 

The centroids S- 'J^"' , i=l,...,cbs and u=l fxs of the shape Codebook CBS*^'^''" , are updated 

during the mth iteration performed in STEP C2 (m= 1 ,...,M 1 ^. y) as follows: 

= C..,,, ^iiMl^^^- — ■ (48) 



where DC, = lV;(c';r' x 
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Qi denotes the cluster of Erm" ^ error vectors which are quantised to the ,9* ' "'-* codebook 
shape vector, cbs represents the total number of shape quantisation levels, J„ represents the 
CBG*^-^-* gain codebook index which encodes the Erm" ^ error vector and l<j<vSn. 

The gain centroids, G!' '"', i=I,„.,cbg of the CBG^-^ '" gain quantiser, which are computed 
during the mth iteration in STEP C3 (m= 1 M2^ ^), are given as: 
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where Dj denotes the cluster of Erm" ^ error vectors which are quantised to the G}' '' *""' gain 
quantiser level, cbg represents the total number of gain quantisation levels, represents the 
CBS*^'^ shape codebook index which encodes the Erm" ^ error vector and 1 <,j<vSn. 

The above employed design process is applied to obtain the optimum shape codebook CBS, 
optimum gain and mean quantizers, CBG and CBM and the optimum prediction coefficient b 
which was finally set to b=0.35. 

Process VII calculates the energy of the residual signal. The LPC analysis performed in 
Process II provides the prediction coefficients a; l<i<p and the reflection coefficients k, 
I<i<p. On the'vpther hand, the Voiced/Unvoiced classification performed in Process 1 
provides the short jerm autocorrelation coefficient for zero delay of the speech signal (RO) for 
the frame under consideration. Hence, the Energy of the residual signal E„ value is given as: 

^"=T?^^fl(l-^')' (50) 

The above expression represents the minimum prediction error as it is obtained from the 
Linear Prediction process. However, because of quantization distortion the parameters of the 
LPC filter used in the coding-decoding process are slightly different from the ones that 
achieve niinimum prediction error. Thus, Equation (50) gives a good approximation of the 
residual signal energy with low computational requirements. The accurate E,^ value can be 
given as: 



66 



(49) 
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. (51) 

The resulting ./f^ then Scalar Quantised using an adaptive {a-law quantised arrangement 
similar to the one depicted in Figure 34. In the case where more than one are used in 

the system i.e. the energy is calculated for a number of subframes then ^ is given by the 

general equation: 

^.a = 17-S ^"('+4^.)' 0<4^H (52) 

Mis /«0 

Notice that when H= 1,M5=M and for H = 4, M,=M/4. 
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C LAIMS 

L A speech synthesis system in which a speech signal is divided into a series 
of frames, and each frame is converted into a coded signal including a 
voiced/unvoiced classification and a pitch estimate, wherein a low pass filtered 
speech segment centred about a reference sample is defined in each frame, a 
correlation value is calculated for each of a series of candidate pitch estimates as 
the maximum of multiple crosscorrelation values obtained from variable length 
speech segments centred about the reference sample, the correlation values are 
used to form a correlation function dcfming peaks, and the locations of the 
peaks are determined and used to define a pitch estimate. 

2. A system according to claim 1, wherein the pitch estimate is defined using 
an iterative process. 

3, A system according to claim I or 2, wherein a single reference sample may 
be used, centred with respect to the respective frame. 

4- A system according to claim t or 2, wherein multiple pitch estimates are 
derived for each frame using different reference samples, the multiple pitch 
estimates being combined to define a combined pitch estimate for the frame. 



wo 98/01848 PCT/GB97/0iaJl 

/ 

69 

5. A system according to any preceding claim, wherein the pitch estimate is 
modified by reference to a voiced/unvoiced status and/or pitch estimates of 
adjacent frames to define a final pitch estimate. 

6. A system according to any preceding claim, wherein the correlation 
function is clipped using a threshold value, remaining peaks being rejected if 
they arc adjacent to larger peaks. 

7. A system according to claim 6, wherein peaks arc selected which are 
larger that either adjacent peak and peaks are rejected if they are smaller than a 
following peak by more than a predetermined factor. 

8. A system according to any preceding claim, wherein the pitch estimation 
procedure is based on a least squares error algorithm. 

9. A system according to claim 8, wherein the pitch estimation algorithm 
defines the pitch valve as a number whose multiples best fit the correlation 
function peak locations. 



10. A system according to any preceding claim, wherein possible pitch values 
are limited to integral numbers which are not consecutive, the increment 
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between two successive numbers being proportional to a constant multiplied by 
the lower of those two numbers. 

11. A speech synthesis system in which a speech signal is divided into a series 
of frames, and each frame is converted into a coded signal including pitch 
segment magnitude spectral information, a voiced/unvoiced classification, and a 
mixed voiced classification which classifies harmonics in the magnitude spectrum 
of voiced frames as strongly voiced or weakly voiced, w herein a series of samples 
centred on the middle of the frame arc windowed to form a data array which is 
Fourier transformed to produce a magnitude spectrum, a threshold value is 
calculated and used to clip the magnitude spectrum, the clipped data is searched 
to define peaks, the locations of peaks are determined, constraints are applied to 
define dominant peaks, and harmonics not associated with a dominant peak are 
classified as weakly voiced. 

12. A system according to claim U, wherein peaks arc located using a second 
order polynomial 

13. A system according to claim 11 or 12, wherein the samples are Hamming 
windowed. 
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14. A system according to claim 11, 12 or 13, wherein the threshold value is 
calculated by identifying the maximum and minimum magnitude spectrum 
values and defining the threshold as a constant multiplied by the difference 
between the maximum and minimum values. 

15. A system according to any one of claims 11 to 14, wherein peaks are 
defined as those values which are greater than the two adjacent values, a peak 
being rejected from consideration it neighbouring peaks are of a similar 
magnitude or if there arc spec^tral magnitudes in the same range of greater 
magnitude. 

16. A system according to any one of claims 1 1 to 15, wherein a harmonic is 
considered as not being associated with a dominant peak if the difference 
between two adjacent peaks is greater than a predetermined threshold value, 

17. A system according to any one of claims 11 to 16, wherein the spectrum is 
divided into bands of fixed width and a strongly/weakly voiced classification is 
assigned for each band. 

18. A system according to any one of claims 11 to 17, wherein the frequency 
range is divided into two or more bands of variable width, adjacent bands being 
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separated at a frequency selected by reference to the strongly/weakly voiced 
classification of harmonics. 

19. A system according to claim 17 or 18, wherein the lowest frequency band 
•is regarded as strongly voiced, whereas the highest frequency band is regarded 
as weakly voiced. 

20. A system according to claim 19, wherein the event that a current frame is 
voiced, and the following frame is unvoiced, further bands within the current 
frame will be automatically classified as weakly voiced. 

21. A system according to claim 19 or 20, wherein the strongly/weakly voiced 
classification is determined using a majority decision rule on the strongly/weakly 
voiced classification of those harmonies which fall within the band in question. 

22. A system according to claim 21, wherein, if there is no majority, alternate 
bands are alternately assigned strongly voiced and weakly voiced classifications. 

23. A speech synthesis system in which a speech signal is divided into a series 
of frames, each frame is defined as voiced or unvoiced, each frame is converted 
into a coded signal including a pitch period value, a frame voiced/unvoiced 
classification and, for each voiced frame, a mixed voiced spectral band 
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classification which classifies harmonics within spectral bands as either strongly 
or weakly voiced, and the speech signal is reconstructed by generating an 
excitation signal in respect of each frame and applying the excitation signal to a 
filter, wherein for each weakly voiced spectral band, an excitation signal is 
generated which includes a random component in the form of a function which is 
dependent upon the respective pitch period value. 

24. A system according to claim 23, wherein the spectrum is divided into 
bands and a strongly/weakly voiced classification is assigned to each band. 

25. A system according to claim 23 or 24; wherein the random component is 
introduced by reducing the amplitude of harmonic oscillators assigned the 

i weakly voiced classification, disturbing the oscillator frequencies such that the 
frequency is no longer a multiple of the fundamental frequency, and then adding 
further random signals. 

26. A system according to claim 25, wherein the phase of the oscillators is 
randomised. 

27. A speech synthesis system in which a speech signal is divided into a series 
of frames, and each voiced frame is con verted into a coded signal including a 
pitch period value LPC coefficients and pitch segment spectral magnitude 
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information^ wherein the spectral magnitude information is quantized by 
sampling the LPC short term magnitude spectrum at harmonic frequencies, the 
locations of the largest spectral samples are determined to identify which of the 
magnitudes are relatively more important for accurate quantization, and the 
magnitudes so identified are selected and vector quantized. 

28, A system according to claim 27, wherein a pitch segment of Pp LPC 
residual samples is obtained, where is the pitch period value of the nth frame, 
the pitch segment is DFT transformed, the mean value of the resultant spectral 
magnitudes is calculated, the mean value is quantized and used as a 
normalisation factor for the selected magnitudeSvand the resulting normalised 
amplitudes are quantized. 

29, A system according to claim 27, wherein the RMS value of the pitch 
segment is calculated, the RMS value is quantized and used as a normalisation 
factor for the selected magnitudes, and the resulting normalised amplitudes are 
quantized. 

30, A system according to any one of claims 27 to 29, wherein , at the receiver, 
the selected magnitudes arc recovered, and each of the other magnitude values is 
reproduced as a constant value. 
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31. A speech synthesis system in which a variable size input vector of 
coefficients to be transmitted to a receiver for the reconstruction of a speech 
signal is vector quantized using a codebook defined by vectors of fixed size, the 
codebook vectors of Fixed size arc obtained from variable sized training vectors 
and an interpolation technique which is an integral part of ^hc codebook 
generation process^ codebook vectors are compared to the variable sized input 
vector using the interpolation process, and an index associated with the codebook 
entry with the smallest difference from the comparison is transmitted, the index 
being used to address a further codebook at the receiver and thereby derive an 
associated fixed size codebook vector, and the interpolation process being used to 
recover fro nil the derived fixed sized codebook vector an approximation of the 
variable sized input vector. 

32. A system according to claim 31, wherein the interpolation process is 
linear, and for an input vector of given dimension, the interpolation process is 
applied to produce from the codebook vectors a set of vectors of that given 
dimension, a distortion measure is then derived to compare the interpolated set 
of vectors and the input vector, and the codebook vector is selected which yields 
the minimum distortion. 
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33. A system according to claim 32, wherein the dimension of the vectors is 
reduced by taking into account oniy the harmonic amplitudes within an input 
bandwidth range. 



34. A system according to claim 33, wherein the remaining amplitudes are set 
to a constant value. 

35. A system according to claim 34, wherein the constant value is equal to the 
mean value of the quantized amplitudes. 

36. A system according to any one of claims 31 to 35, wherein redundancy 
between amplitude vectors obtained from adjacent residual frames is removed 
^by means of backward prediction. 

37. A system according to claim 36, wherein the backward prediction is 
performed on a harmonic basis such that the amplitude value of each harmonic 
of one frame is predicted from the amplitude value of the same harmonic in the 
previous frame or frames. 

38. A speech synthesis system in which a speech signal is divided into a series 
of frames, each frame is converted into a coded signal including an estimated 
pitch period, an estimate of the energy of a speech segment the duration of which 
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is a fuQCtion of the estimated pitch period, and LPC fliter coefncicnts defining an 
LPC spectral envelope, and a speech signal of related power to the power of the 
input speech signal is reconstructed by generating an excitation signal using 
spectral amplitudes which are deflned from a modified LPC spectral envelope 
sampled at harmonic frequencies defined by the pitch period. 

39. A system according to claim 38, wherein the magnitude .values are 
obtained by spectrally sampling a modified LPC synthesis filter characteristic at 
the harmonic locations related to the pitch period. 

40- A system according to claim 39, wherein the modified LPC synthesis filter 
has reduced feed back gain and a frequency response which consists of equalised 
resonant peaks, the locations of which are close to the LPC synthesis resonant 
locations. 

4L A system according to claim 40, wherein the value of the feed back gain is 
controlled by the performance of the LPC model such that it is related to the 
normalised LPC prediction error. 

42. A system according to any one of claims 38 to 41, wherein the energy of 
the reproduced speech signal is equal to the energy of the original speech 
waveform. 
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43. A speech synthesis system in which a speech signal is divided into a series 
of frames, each frame is converted into a coded signal including LPC filter 
coefficients and at least one parameter associated with a pitch segment 
magnitude, and the speech signal is reconstructed by generating two excitation 
signals in respect of each frame, each pair of excitation signals comprising a first 
excitation signal generated on the basis of the pitch segment magnitude 
parameter or parameters of one frame and a second excitation signal generated 
on the basis of the pitch segment magnitude parameter or parameters of a 
second frame which follows and is adjacent to the said one frame, applying the 
first excitation signal to a first LPC filter the characteristics of which are 
determined by the LPC filter coefficients of the said one frame and applying the 
second excitation signal to a second LPC filter the characteristics of which are 
determined by the LPC filter coefficients of the said second frame, and weighting 
and combining the outputs of the first and second LPC filters to produce one 
frame of a synthesised speech signal. 

44. A system according to claim 43, wherein the first and second excitati<m 
signals include the same phase function and different phase contributions from 
the two LPC filters. 
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45. A system according to claim 44, wherein the outputs of the first and 
second LPC filters arc weighted by half a window function such that the 
magnitude of the output of the first filter is decreasing with time and the 
magnitude of the output of the second filter is increasing with time. 

46. A speech coding system which operates on a frame by frame basis, and in - 
which information is transmitted which represents each frame as cither voiced or 
unvoiced and, for each voiced frame, represents that frame by a pitch period 
value, quantized magnitude spectral information, and LPC filter coefficients, the 
received pitch period value and magnitude spectral information being used to 
generate residual signals at the receiver which are applied to LPG speech 
synthesis filters the characteristics of which are determined by the transmitted 
filter coefficients, wherein each residual signal is synthesised accordingj^o 
sinusoidal mixed excitation synthesis process, and a recovered speech signal is^ 
derived from the residual signals. 

47. A speech synthesis system substantially as hereinbefore described with 
reference to the accompany drawings. 
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